Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Hadoop split files without losing data integrity?

We all know that if an input file is large it is split into equal-size splits (size of 64 MB by default). Let say I have a .txt file which is 104 MB large. Theoretically, this file is split in to 2 splits (one is 64 MB large and another is 40 MB large). Is it possible that the split can occur at the middle of a word? For example, "Hadoop", "Ha" will be the end of the first split and "doop" will be the beginning of the second split. If this occur, how we can perform WordCount problem properly?

like image 890
duong_dajgja Avatar asked Apr 27 '26 13:04

duong_dajgja


1 Answers

That logic is encapsulated in the InputFormat configured for the mapper. There are different subclasses of InputFormat and you choose the subclass specific to the kind of file you consume with the Mapper. For example, the TextInputFormat class breaks lines on line feeds. There may be a partial line at the beginning or end of a split, but the logic recognizes those situations and still returns the complete line to exactly one mapper.

like image 55
Chris Gerken Avatar answered Apr 30 '26 07:04

Chris Gerken



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!