Is my understanding correct that job tracker launches task(mapper/reducer) on datanode where inputsplit is stored and runs that task on that piece of data and mapper stores it's intermediate output in its local storage ?
so my question is: as mapper is running on datanode so it stores it's intermediate data on datanode's RAM? And as datanode disk is the part of an hdfs and intermediate output is not stored on hdfs..
The output of the Mapper (intermediate data) is stored on the Local file system (not HDFS) of each individual mapper data nodes. This is typically a temporary directory which can be setup in config by the Hadoop administrator. Once the Mapper job completed or the data transferred to the Reducer, these intermediate data is cleaned up and no more accessible.
The Map tasks initially store its output in the buffer of the datanode.
Once the buffer is filled up to 80% of its capacity, it starts to write on the disk of the datanode itself (not HDFS). This disk location can be viewed/modified in the mapred-site.xml in Hadoop 2.0 under property name-
mapreduce.cluster.local.dir
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With