MapReduce

Sep 29

Written By Rema Shivakumar- CuriouSTEM Staff

MapReduce is an important component of Hadoop responsible for processing data stored in HDFS. MapReduce, a programming model designed to work on a distributed platform allowing for parallel processing of large chunks of data. The MapReduce includes the map and reduce components. Map phase splits data and performs the procedure that is included in the mapper program given by the client. The output of mapper goes to the reducing phase, which shuffles and summarizes the information as required by the client. We can see here that MapReduce is the processing layer.

The map and reduce programs are specified along with input files and output directory. The input and output of both phases are key and value pairs that are computed based on client requirements. YARN assigns Application master that handles containers running in various nodes that carry out the tasks parallely. The order of tasks is determined by the scheduler. After data arrives for processing, it is split into fixed size blocks. The Map phase takes the blocks and runs the map function on each record of data. The output is stored in the local disk as intermediate output. If the client has specified a reduce program, the data is read into the reduce phase after sorting it. The reduce phase performs some kind of aggregation and generates output stored in HDFS.

MapReduce has multiple advantages:

It can be scaled to handle Big Data of petabytes of data.

It is quick as it processes data parallelly in multiple data nodes.

Clients can specify their programs in a language of their choice such as python, C++ etc.

Doesn’t need special hardware , can run on commodity hardware

TechnologyComputer Science

Rema Shivakumar- CuriouSTEM Staff

CuriouSTEM Content Director - Computer Science

MapReduce

Yet Another Resource Navigator

Hadoop Distributed File System