InputSplit customization in Hadoop

Question

I understand that in Hadoop, the large input file splits into small files and gets processed in different nodes by the map functions. I also got to know that we can customize the InputSplits. What I would like to know is if the following type of customization is possible for the InputSplit:

I have a large input file coming in to Hadoop, I want a subset of the file, i.e. a set of lines in the file to go along with every input split. I mean all the data chunks of the large file should contain these set of lines, irrespective of whatever way the file is split.

To make my question more clear, something like if we need to compare a part of the input file (say A) with the rest of the file content, in that case all InputSplits that are going to the map function need to have this A part with it for the comparison. Kindly guide me on this.

harpun · Accepted Answer

Theoretically it would be possible to split your big file (A, B, C, D, ...) into splits (A, B), (A, C), (A, D), .... However you'd have to write a lot of custom classes for this purpose. Currently the FileSplit, which extends InputSplit, basically says that the split for file begins at position start and has a fixed length. The actual access to the file is done by a RecordReader, i.e. LineRecordReader. So you would have to implement code, which will read not only the actual split, but the header (part A) of the file as well.

I'd argue, that the approach you're looking for is unpractical. The reason for which the record reader accesses the positions (start, start+length) only is data-locality. For a very big file, parts A and Z would be on two different nodes.

Depending on the size of part A, a better idea would be storing this common part in the DistributedCache. In this way you could access common data in each of the mappers in an efficient way. Refer to the javadoc and http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata for further information.

InputSplit customization in Hadoop

Tags:

hadoop

Neethu Prem

1 Answers

harpun

Recent Activity

Donate For Us

InputSplit customization in Hadoop

Tags:

hadoop

Neethu Prem

1 Answers

harpun

Related questions

Recent Activity

Donate For Us