Is the input format responsible for implementing data locality in Hadoop's MapReduce?

Tags:

I am trying to understand data locality as it relates to Hadoop's Map/Reduce framework. In particular I am trying to understand what component handles data locality (i.e. is it the input format?)

Yahoo's Developer Network Page states "The Hadoop framework then schedules these processes in proximity to the location of data/records using knowledge from the distributed file system." This seems to imply that the HDFS input format will perhaps query the name node to determine which nodes contain the desired data and will start the map tasks on those nodes if possible. One could imagine a similar approach could be taken with HBase by querying to determine which regions are serving certain records.

If a developer writes their own input format would they be responsible for implementing data locality?

731

asked May 25 '11 17:05

jmdev

1 Answers

You're right. If you're looking at the FileInputFormat class and the getSplits() method. It searches for the Blocklocations:

BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);

This implies the FileSystem query. This happens inside the JobClient, the results getting written into a SequenceFile (actually it's just raw byte code). So the Jobtracker reads this file later on while initializing the job and is pretty much just assigning a task to an inputsplit.

BUT the distribution of the data is the NameNodes job.

To your question now: Normally you are extending from the FileInputFormat. So you will be forced to return a list of InputSplit, and in the initialization step it is required for such a thing to set the location of the split. For example the FileSplit:

public FileSplit(Path file, long start, long length, String[] hosts)

So actually you don't implement the data locality itself, you are just telling on which host the split can be found. This is easily queryable with the FileSystem interface.

180

answered Sep 19 '22 19:09

Thomas Jungblut

Related questions
                            
                                How to increase number of regions in an HBase table
                            
                                Handling Big Data in a Datawarehouse [closed]
                            
                                Hadoop Hive unable to move source to destination
                            
                                Apache Spark: Error while starting PySpark
                            
                                How to do performance profiling of Hadoop cluster
                            
                                Is HDFS necessary for Spark workloads?
                            
                                persist permissions to all files in directory HDFS
                            
                                what's the HDFS writing consistency
                            
                                Hadoop - Create external table from multiple directories in HDFS
                            
                                Do mappers store it's intermediate outputs on datanode's RAM on which it is running?
                            
                                Apache Hive: How to convert string to timestamp?
                            
                                Conversion Hive datediff() to months
                            
                                Query Parquet data through Vertica (Vertica Hadoop Integration)
                            
                                Cannot use a "." in a Hive table column name
                            
                                PySpark: Handing NULL in Joins
                            
                                Streaming data store in hive using spark
                            
                                Python Hadoop streaming on windows, Script not a valid Win32 application
                            
                                Spark & Scala: saveAsTextFile() exception
                            
                                Starting HBASE, java.lang.ClassNotFoundException: org.apache.htrace.SamplerBuilder
                            
                                How to fix "Error: Could not find or load main class ”-Djava.library.path=.usr.local.hadoop.lib” while installing hadoop

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is the input format responsible for implementing data locality in Hadoop's MapReduce?

Tags:

hadoop

mapreduce

hbase

hdfs

jmdev

People also ask

1 Answers

Thomas Jungblut

Recent Activity

Donate For Us