Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do mappers store it's intermediate outputs on datanode's RAM on which it is running?

Is my understanding correct that job tracker launches task(mapper/reducer) on datanode where inputsplit is stored and runs that task on that piece of data and mapper stores it's intermediate output in its local storage ?

so my question is: as mapper is running on datanode so it stores it's intermediate data on datanode's RAM? And as datanode disk is the part of an hdfs and intermediate output is not stored on hdfs..

like image 946
user2017 Avatar asked Aug 14 '16 22:08

user2017


2 Answers

The output of the Mapper (intermediate data) is stored on the Local file system (not HDFS) of each individual mapper data nodes. This is typically a temporary directory which can be setup in config by the Hadoop administrator. Once the Mapper job completed or the data transferred to the Reducer, these intermediate data is cleaned up and no more accessible.

like image 189
Kris Avatar answered Oct 16 '22 20:10

Kris


The Map tasks initially store its output in the buffer of the datanode.

Once the buffer is filled up to 80% of its capacity, it starts to write on the disk of the datanode itself (not HDFS). This disk location can be viewed/modified in the mapred-site.xml in Hadoop 2.0 under property name-

mapreduce.cluster.local.dir
like image 23
Sumeet Gupta Avatar answered Oct 16 '22 20:10

Sumeet Gupta