How Hadoop works

As we have seen in previous two articles on Hadoop, Hadoop is used to store large amounts of data in a distributed fashion using a cluster of commodity hardware. The client submits their data and programs to run. The HDFS stores the files in chunks of hdfs blocks, MapReduce takes care of the processing part by running the programs. The YARN is responsible for allocation of resources and division of tasks. HDFS consists of nodes namely name node and data nodes. Name node is the master daemon and it manages the data nodes. It holds the metadata which is in the form of log files and fsimage. Data node is responsible for storage of data. It sends heartbeat signals to the namenode every 3 seconds. The files are divided into blocks and assigned to different data nodes. The data is always replicated so that in case of data node failure, the data is not lost.

The MapReduce layer takes the processing requirements and divides them into independent tasks that can be processed parallely across different data nodes. The processing is split into mapping and reducing functions. The mapping function involves business logic implementation. The reduction function consists of summarizing or aggregation operations. YARN is an important resource management layer. YARN handles assignment of tasks and resources. There are resource managers and node managers. Resource manager has two components called Scheduler and application manager. Node manager contains containers and some of them can run a special component called Application Master. The application master manages tasks running in various containers allocated by the resource manager. After the processing is complete, the output is written back to the HDFS.

Picture Source: data-flair.training

Picture Source: data-flair.training

Rema Shivakumar- CuriouSTEM Staff

CuriouSTEM Content Director - Computer Science

Previous
Previous

Hadoop Distributed File System

Next
Next

What is Big Data?