I understand that in Hadoop, the large input file splits into small files and gets processed in different nodes by the map functions. I also got to know that we can customize the InputSplit
s. What I would like to know is if the following type of customization is possible for the InputSplit
:
I have a large input file coming in to Hadoop, I want a subset of the file, i.e. a set of lines in the file to go along with every input split. I mean all the data chunks of the large file should contain these set of lines, irrespective of whatever way the file is split.
To make my question more clear, something like if we need to compare a part of the input file (say A
) with the rest of the file content, in that case all InputSplit
s that are going to the map
function need to have this A
part with it for the comparison.
Kindly guide me on this.
Theoretically it would be possible to split your big file (A, B, C, D, ...)
into splits (A, B), (A, C), (A, D), ...
. However you'd have to write a lot of custom classes for this purpose. Currently the FileSplit, which extends InputSplit, basically says that the split for file
begins at position start
and has a fixed length
. The actual access to the file is done by a RecordReader
, i.e. LineRecordReader. So you would have to implement code, which will read not only the actual split, but the header (part A
) of the file as well.
I'd argue, that the approach you're looking for is unpractical. The reason for which the record reader accesses the positions (start, start+length) only is data-locality. For a very big file, parts A
and Z
would be on two different nodes.
Depending on the size of part A
, a better idea would be storing this common part in the DistributedCache. In this way you could access common data in each of the mappers in an efficient way. Refer to the javadoc and http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata for further information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With