We know that Hadoop uses the data locality principle for spawning map-reduce jobs to save network bandwidth. Here is a description of how this works:
Taken from : http://hadoop-gyan.blogspot.in/
Hadoop tries its best to run map tasks on nodes where the data is present locally to optimize on the network and inter-node communication latency. As the input data is split into pieces and fed to different map tasks, it is desirable to have all the data fed to that map task available on a single node.Since HDFS only guarantees data having size equal to its block size (64M) to be present on one node, it is advised/advocated to have the split size equal to the HDFS block size so that the map task can take advantage of this data localization.
Hadoop is capable of running map-reduce jobs even if the underlying file system is not HDFS (i.e., it can run on other filesystems such as Amazon's S3). Now, how is the data locality accounted for in this case? In the case of HDFS the namenode had all the block location information and using that the mappers were spawned as close to the data as possible. However, in other filesystems there is no concept of a namenode. Then, how does the Hadoop MapReduce framework (JobTracker and TaskTracker) learn the location of the data to apply the data locality principle when running a job?
As you mentioned, filesystems like S3 do not need the namenonde to run. In this case the data locality optimization is not available.
Reference: http://wiki.apache.org/hadoop/AmazonS3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With