Thursday, February 11, 2016

Big Data & Hadoop Map Reduce

What is Big Data?

Big data as the name suggest refer to the data with the following 3 characteristics:
1- High Generation Rate i.e. High Velocity of data generation
2- Big data size i.e. big Volume of data
3- Many sources for the data i.e. Variety of data sources.
These 3 Vs = Velocity (speed) Volume (size) Variety (sources) represents what is named the big data, such data is usually semi-structured data from different sources.

Apache Hadoop:
Apache Hadoop is an open-source framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built.

So we are discussing here 2 concepts in hadoop distributes storage and distributed processing.

1) Distributed Storage: 
In Hadoop Data stored in HDFS : Hadoop Distributed File System where each file is splitted into blocks (default size of each block is 64MB) blk_1, blk_2, etc..
Each block is replicated 3 times in Data Nodes (for data redundancy) (this means if u have 100MB of data you'll need 300MB of storage).  and if any node failed, hadoop will auto-replicate the data into another data node.
The Name Node determines the location of different parts of each file stored in HDFS.
Name node is stored usually in NFS (Network file system) for higher reliability and it runs also on active-standby node model where the standby node can take over if active node failed.
2) Distributed Processing:
We have 2 concepts here; the mappers and reducers for distributed data processing.
- Each Mapper works in small set of data and produces intermediate records in a (key, value) pairs format.
- Data go through Shuffle step which move the data into reducers and sort them (optionally)
- The Reducers can reduce the data into format (key, values).
- Finally if we have multiple reducers additional step is required to merge and sort the keys to produce the final required results.

Job Tracker is responsible for splitting the work into different mappers and reducers.

In each data node, a Task Tracker exist which will usually assign the existing data block to the mapper in the same node to reduce the required network traffic but in case the current mapper is busy (for the 3 copies of the data block) the task tracker can delegate the task into a different data node mapper and stream the block into that node but this happens rarely.

Combiners can do some reduction to reduce the data moved over the network.

Hadoop Common Design Patterns:
1) Filtering Patterns
e.g. Sampling or Top-N list

Filter the records, do not change them, keep some and discard some.
The output is subset of the original data-set.

- Simple filter e.g. function
- Bloom Filter
- Sampling Filter
- Random Sampling
- Top-n filter.
  Each mapper find the top-n list and the reducer produce the final top-n from these lists.

2) Summarization Patterns
e.g. Counting, Min/Max, Statistics, index, etc.
- Inverted Index (reverse index for faster search)
- Numerical Summarization
    e.g. count, min/max, first/last, mean, median, average, etc.
Combiners are important here.

3) Structural Patterns
e.g. Combining data sets.
Structural to Hierarchical Pattern
- Moves the structured data in RDBMS into Hadoop but hierarchical
the data must be linked in the source using FK and must be structured in row-based.

4) Others such as Organization, I/O, ...etc.

For More Details

This is just a nutshell about Big Data and Hadoop Map Reduce and as of the Udacity

Intro to Hadoop and MapReduce