I am trying to understand data locality as it relates to Hadoop's Map/Reduce framework. In particular I am trying to understand what component handles data locality (i.e. is it the input format?)
Yahoo's Developer Network Page states "The Hadoop framework then schedules these processes in proximity to the location of data/records using knowledge from the distributed file system." This seems to imply that the HDFS input format will perhaps query the name node to determine which nodes contain the desired data and will start the map tasks on those nodes if possible. One could imagine a similar approach could be taken with HBase by querying to determine which regions are serving certain records.
If a developer writes their own input format would they be responsible for implementing data locality?
InputFormat describes the input-specification for a Map-Reduce job. The Map-Reduce framework relies on the InputFormat of the job to: Validate the input-specification of the job. Split-up the input file(s) into logical InputSplit s, each of which is then assigned to an individual Mapper .
a) Text InputFormat- It is the default InputFormat of MapReduce. It uses each line of each input file as the separate record.
They are sequenced one after the other. The Map function takes input from the disk as <key,value> pairs, processes them, and produces another set of intermediate <key,value> pairs as output. The Reduce function also takes inputs as <key,value> pairs, and produces <key,value> pairs as output.
No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster takes the input and the output type as 'text'.
You're right. If you're looking at the FileInputFormat
class and the getSplits()
method. It searches for the Blocklocations:
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
This implies the FileSystem query. This happens inside the JobClient
, the results getting written into a SequenceFile (actually it's just raw byte code).
So the Jobtracker reads this file later on while initializing the job and is pretty much just assigning a task to an inputsplit.
BUT the distribution of the data is the NameNodes job.
To your question now:
Normally you are extending from the FileInputFormat
. So you will be forced to return a list of InputSplit
, and in the initialization step it is required for such a thing to set the location of the split. For example the FileSplit
:
public FileSplit(Path file, long start, long length, String[] hosts)
So actually you don't implement the data locality itself, you are just telling on which host the split can be found. This is easily queryable with the FileSystem
interface.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With