Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is the input format responsible for implementing data locality in Hadoop's MapReduce?

I am trying to understand data locality as it relates to Hadoop's Map/Reduce framework. In particular I am trying to understand what component handles data locality (i.e. is it the input format?)

Yahoo's Developer Network Page states "The Hadoop framework then schedules these processes in proximity to the location of data/records using knowledge from the distributed file system." This seems to imply that the HDFS input format will perhaps query the name node to determine which nodes contain the desired data and will start the map tasks on those nodes if possible. One could imagine a similar approach could be taken with HBase by querying to determine which regions are serving certain records.

If a developer writes their own input format would they be responsible for implementing data locality?

like image 731
jmdev Avatar asked May 25 '11 17:05

jmdev


People also ask

What is the input format in MapReduce?

InputFormat describes the input-specification for a Map-Reduce job. The Map-Reduce framework relies on the InputFormat of the job to: Validate the input-specification of the job. Split-up the input file(s) into logical InputSplit s, each of which is then assigned to an individual Mapper .

Is there any default input format for input files in MapReduce process?

a) Text InputFormat- It is the default InputFormat of MapReduce. It uses each line of each input file as the separate record.

What is the input to a MapReduce program reduce function?

They are sequenced one after the other. The Map function takes input from the disk as <key,value> pairs, processes them, and produces another set of intermediate <key,value> pairs as output. The Reduce function also takes inputs as <key,value> pairs, and produces <key,value> pairs as output.

Is it mandatory to set input and output type format in Hadoop MapReduce?

No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster takes the input and the output type as 'text'.


1 Answers

You're right. If you're looking at the FileInputFormat class and the getSplits() method. It searches for the Blocklocations:

BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);

This implies the FileSystem query. This happens inside the JobClient, the results getting written into a SequenceFile (actually it's just raw byte code). So the Jobtracker reads this file later on while initializing the job and is pretty much just assigning a task to an inputsplit.

BUT the distribution of the data is the NameNodes job.

To your question now: Normally you are extending from the FileInputFormat. So you will be forced to return a list of InputSplit, and in the initialization step it is required for such a thing to set the location of the split. For example the FileSplit:

public FileSplit(Path file, long start, long length, String[] hosts)

So actually you don't implement the data locality itself, you are just telling on which host the split can be found. This is easily queryable with the FileSystem interface.

like image 180
Thomas Jungblut Avatar answered Sep 19 '22 19:09

Thomas Jungblut