Techie By Nature: July 2014

Friday, July 18, 2014

Hadoop Vs RDBMS

Hadoop or RDBMS? There are a lot of different opinions out there among these two technologies. Few people think Hadoop is a replacement of RDBMS, few say Hadoop is the future, and few thinks Hadoop have just got hype. So I have tried to find out what is the reality what Hadoop is really meant for.

Let talk about the History

RDBMS was developed with intent of storing and analyzing data, it was developed at a point on timeline when the computation was slow and logical disks were expensive, the whole idea behind this technology was to store data in a structured (tabular) and logical manner so that alot of data can be stored in a small space and since the size is small it can be analyzed with very less computational effort. The unstructured data were not getting much of response; generally unstructured data were getting dumped by the IT giants and other systems. RDBMS evolved as a great technology over time and its capabilities grown to the heights of advancement. RDBMS is a really a phenomenal ETL tool even after 40-50 years from its origin. But RDBMS have its own limitations.

In the current world when computation is ridiculously fast, Disks are ridiculously cheap and different advanced technologies are producing huge amount of unstructured data. The Devels (developers) of the new advanced world started thinking beyond the limitations of RDBMS capabilities and realizes the importance of this huge data which were getting dumped by different systems. Time was the moment Hadoop came in existence. It is a technology which is deals with unstructured data very efficiently and reliably.

Differences in capabilities

Filtering and Aggregating – RDBMS is good at filtering and aggregating things quickly. RDBMS will spit out the records in a reasonably small time if you want to find out some specific data from a huge database based on some specific scenario.

Interactive applications – if you have data warehouse and you used to fly into your cubes very interactively and very quickly getting answers to questions which are sort of pre-computed then Hadoop would be a bad choice. On the other hand RDBMS are bad with problem like “how many or which are the new customers who have visited you store first time ever last one year?” Something where you need to process each line of you database. RDBMS always takes time in founding out counts other thing where the RDBMS always eat up time is COURSERS where you loop through rows.

Orientation of Data – Hadoop is best suited for raw data and RDBMS for structural data. Hadoop is a “schema on read” technology but RDBMS is a “Schema on write” technologies i.e. Before writing the data in a transactional database you need to have a hand full of information like Structure of the tables and data types, and you need to convert your data to fit in the tables. But in case to Hadoop you just need to distribute the data across the cluster and then read in whatever structure you want to.

Failures – Hadoop handles the failure of nodes internally. MapReduce queries don’t hold up results even in case of failures, because of data replication mechanism the delivered results are reliable and consistent against the no failure scenario. On the other hand a Relational database management system (RDBMS) does not release any results to the user in case of failure because it doesn’t have the complete dataset which is called two phases commit.

Scalability – Hadoop solutions are something scalable to 1000 servers or even beyond that it’s very cheap and reliable. RDBMS starts giving problem when it need to scale. Scaling for a RDBMS system is very expensive, with

Objective of solution – RDBMS and star schema is fundamentally about transaction. Where we are doing the financial planning but they are not really grateful for marketing stuffs because in market planning we are not really seeking answers about transactions we are asking questions about peoples and customers. RDBMS is good for simple things whatever that fits in a table RDBMS is a great analyzing tool but when things starts exceeding out of tables like Customers who are using your website, Hadoop is the better option.

Expert’s Opinion

According to a joint report from Cloudera and Teradata – “Hadoop and the data warehouse will often work together in a single information supply chain. When it comes to Big Data, Hadoop excels in handling raw, unstructured and complex data with vast programming flexibility. Data warehouses also manage big structured data, integrating subject areas and providing interactive performance through BI tools.”

When to Use Which

Requirement	Data Warehouse	Hadoop
Low latency, interactive reports, and OLAP	·
ANSI 2003 SQL compliance is required	·
Preprocessing or exploration of raw unstructured data		·
Online archives alternative to tape		·
High-quality cleansed and consistent data	·
100s to 1000s of concurrent users	·	·
Discover unknown relationships in the data	·	·
Parallel complex process logic		·
CPU intense analysis	·	·
System, users, and data governance	·
Many flexible programming languages running in parallel		·
Unrestricted, ungoverned sand box explorations		·
Analysis of provisional data		·
Extensive security and regulatory compliance	·
Real time data loading and 1 second tactical queries	·	·

Conclusions

Hadoop is different in Quality from the RDBMS systems. Hadoop is good in exhaustive batch processing deep analysis. If you've got an existing enterprise data warehouse problem then Hadoop is gonna map poorly. Where Hadoop will really shine is the new data sources which are online, if you want to understand them digest them and maybe even load summary of them into your cube. Hadoop is not a replacement of RDBMS but it is a technology which needs to run parallel to your RDBMS infrastructure. RDBMS is still unequivocally a good tool to run the ETL process Hadoop is entirely a different perspective in the world of database which deals with a different kind of problems.

Thursday, July 17, 2014

HDFS Architecture

The Hadoop Distributed File System (HDFS) is a highly fault tolerant file system designed and optimized to be deployed on a distributed infrastructure established with a bunch commodity hardware. HDFS provides high throughput access to application data and is best suited for applications that have large data sets. Unlike existing distributed file systems HDFS have loosen up a few POSIX Standards to enable streaming access to file system data. HDFS was originally developed as an infrastructure for the Apache Nutch web search engine project.

Inside HDFS

HDFS is based on Master-Slave Architecture. A typical HDFS cluster consists one NameNode and multiple DataNodes generally one per node in the cluster. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode. NameNode acts as the master in Master-Slave Architectural pattern and manages the file system namespace and regulates access to files by client Applications. DataNode manages the user data stored on the node that they run on. A file is split into one or more blocks and set of blocks are stored in DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes from the heartbeats and block reports sent by DataNodes. DataNodes executes read, write requests and performs block creation, block deletion, and block replication only when NameNode commands them to.

Deployment

NameNode and DataNode are software programs designed in java, so it can execute on any machine having java installed (typically Unix/Linux), it can be deployed on a wide range of machines. In real time scenarios NameNodes are deployed on a high end dedicated machines and all other machines in the cluster are commodity hardware running DataNode program. Each cluster consists of one NameNode and multiple DataNodes. There could be multiple NameNodes within a cluster in some special scenarios, In order to scale name services horizontally federation uses multiple independent NameNodes within a cluster these NameNodes does not require coordination between each other. The DataNodes are used as common storage among the NameNodes, and all the DataNodes are required to be registered with all the NameNodes and send periodic heartbeats and block reports to the NameNodes.

Friday, July 11, 2014

Discovery of Hadoop

Back in the early 2000's Dough Cutting was attempting to build an Open Source Search engine called Nutch. They were facing trouble managing their distributed system even when they were running on very few computers. Nutch was stuck with its two half time developers. At the same time Google was facing band problem because they were ingesting the entire internet frequently, they needed to process all the data available on the entire World Wide Web in every couple of days, and it was practically impossible to build an index over the entire internet in a reasonable amount of time by using any commercial tool available. They had documents that were web pages and also their own logs that they had generated, they needed a system which can readout the whole data in time and the problem was they couldn't go buy a document maintenance system which can do their job because there were none available. So they designed and developed their new infrastructure at home which was MapReduce.

MapReduce was a pretty simple idea, you can use some commodity servers with some memory disk attached and every server had a regional amount of CPU. The data were distributed among all the servers with a safe number of replication so that in case you lose a server you got another copy the data somewhere. Now you have stored all your data very cheaply in pretty reliably and the best part is you've got CPU's attached to the disks so if you want to do some indexing or transformation you can use the local CPU to chew over the data and you get this huge parallelism in your data processing. You don’t have to funnel the whole data through a single processer and this basic idea worked and changed the old world where everything was centralized.

After Google published its papers on GFS and MapReduce in 2004. Nutch got its route clear they implemented the same their system was not perfect but somehow they managed to run on 20 systems. Very soon they realized that it wasn’t something two half timers (Dough Cutting and Mike Cefarella) can handle because they needed to run their processes on thousand of machines and they needed more peoples.

Around that time Yahoo got interested in their work Yahoo folks were looking into these projects to add more capabilities to their search engine. They left out the search engine part of Nutch and developed the distributed computing part of it and named Hadoop.

Thursday, July 10, 2014

How to Find out Next and Previous Day of Week in Oracle

Tips and Tricks

I have seen people writing number of lines of code to find out the date on a specific day in current or previous week, most often we need Friday the last working day of the week. We have a pretty simple way to find out in Oracle Queries.

Logic:

· Add a number of the day you want the date for in your current date, assuming Monday as day 1, Tuesday as day 2 and so on.

2TRUNC(TO_DATE(SYSDATE,'YYYYMMDD') + 1, 'd')

Since Oracle considers Sunday as the first day of the week and Saturday as the last day adding 1 will shift this week frame one day back now week starts from Saturday and ends on Friday and TRUNC will give the currents weeks Sunday.

· Now whatever day you want in the week just add or subtract the number of day from the Sunday you got. In our case we are looking for Friday so we add 5 since it is the 5 day of our new week.

2TRUNC(TO_DATE(SYSDATE,'YYYYMMDD') + 1, 'd') + 5

So our Query will look something like this

1SELECT TRUNC(TO_DATE(SYSDATE,'YYYYMMDD') + 1, 'd') + 5

2FROM DUAL;

3

EXAMPLES:

Query for Next Friday:

SELECT TRUNC(TO_DATE('20140705','YYYYMMDD') + 1, 'd') + 4 AS SAT,                     TRUNC(TO_DATE('20140706','YYYYMMDD') + 1, 'd') + 4 AS SUN,                     TRUNC(TO_DATE('20140707','YYYYMMDD') + 1, 'd') + 4 AS MON,                     TRUNC(TO_DATE('20140708','YYYYMMDD') + 1, 'd') + 4 AS TUE,                     TRUNC(TO_DATE('20140709','YYYYMMDD') + 1, 'd') + 4 AS WED,                     TRUNC(TO_DATE('20140710','YYYYMMDD') + 1, 'd') + 4 AS THU,                     TRUNC(TO_DATE('20140711','YYYYMMDD') + 1, 'd') + 4 AS FRI

FROM DUAL;

Output:

Query for Previous Friday:

SELECT TRUNC(TO_DATE('20140705','YYYYMMDD') + 1, 'd') + 4 AS SAT,                     TRUNC(TO_DATE('20140706','YYYYMMDD') + 1, 'd') + 4 AS SUN,                     TRUNC(TO_DATE('20140707','YYYYMMDD') + 1, 'd') + 4 AS MON,                     TRUNC(TO_DATE('20140708','YYYYMMDD') + 1, 'd') + 4 AS TUE,                     TRUNC(TO_DATE('20140709','YYYYMMDD') + 1, 'd') + 4 AS WED,                     TRUNC(TO_DATE('20140710','YYYYMMDD') + 1, 'd') + 4 AS THU,                     TRUNC(TO_DATE('20140711','YYYYMMDD') + 1, 'd') + 4 AS FRI

FROM DUAL;

Output: