I want to process a lot of files in Hadoop -- each file has some header information, followed by a lot of records, each stored in a fixed number of bytes. Any suggestions on that?
I think the best solution is to write a custom InputFormat.
There is one solution , you can check the offset of line of files that mapper reads . It will be zero for the first line in the file . so you can add line in Map as follows:
public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException {
if(key.get() > 0)
{
your mapper code
}
}
So, it will skip the first line of the file.
However, its not a good way because in this way this condition will be checked for each line in the file.
Best way is to use your Custom Input Format
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With