Processing files with headers in Hadoop

Question

I want to process a lot of files in Hadoop -- each file has some header information, followed by a lot of records, each stored in a fixed number of bytes. Any suggestions on that?

Paolo Capriotti · Accepted Answer

I think the best solution is to write a custom InputFormat.

Sourav Gulati · Answer

There is one solution , you can check the offset of line of files that mapper reads . It will be zero for the first line in the file . so you can add line in Map as follows:

public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException {

        if(key.get() > 0)
                       {
                         your mapper code
                       }
              }

So, it will skip the first line of the file.

However, its not a good way because in this way this condition will be checked for each line in the file.

Best way is to use your Custom Input Format

Processing files with headers in Hadoop

Tags:

hadoop

2 Answers

Paolo Capriotti

Sourav Gulati

Recent Activity

Donate For Us

Processing files with headers in Hadoop

Tags:

hadoop

2 Answers

Paolo Capriotti

Sourav Gulati

Related questions

Recent Activity

Donate For Us