Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting SequenceFile in controlled manner - Hadoop

Tags:

hadoop

hadoop writes in a SequenceFile in in key-value pair(record) format. Consider we have a large unbounded log file. Hadoop will split the file based on block size and save them on multiple data nodes. Is it guaranteed that each key-value pair will reside on a single block? or we may have a case so that key is in one block on node 1 and value(or parts of it) on second block on node 2? If we may have unmeaning-full splits, then what is the solution? sync markers?

Another question is: Does hadoop automatically write sync markers or we should write it manually?

like image 626
Majid Azimi Avatar asked Dec 06 '11 19:12

Majid Azimi


1 Answers

I asked this question in hadoop mailing list. They answered:

Sync markers are written into sequence files already, they are part of the format. This is nothing to worry about - and is simple enough to test and be confident about. The mechanism is same as reading a text file with newlines - the reader will ensure reading off the boundary data in order to complete a record if it has to.

then I asked:

So if we have a map job analysing only the second block of the log file, it should not transfer any other parts of that from other nodes because that part is stand alone and meaning full split? Am I right?

They answered:

Yes. Simply put, your records shall never break. We do not read just at the split boundaries, we may extend beyond boundaries until a sync marker is encountered in order to complete a record or series of records. The subsequent mappers will always skip until their first sync marker, and then begin reading - to avoid duplication. This is exactly how text file reading works as well -- only here, it is newlines.

like image 79
Majid Azimi Avatar answered Oct 06 '22 00:10

Majid Azimi