I am new to Map-reduce and I want to understand what is sequence file data input? I studied in the Hadoop book but it was hard for me to understand.
In computing, as well as in non-computing contexts, a file sequence is a well-ordered, (finite) collection of files, usually related to each other in some way.
Sequence files are in the binary format which can be split and the main use of these files is to club two or more smaller files and make them as a one sequence file. In Hive we can create a sequence file by specifying STORED AS SEQUENCEFILE in the end of a CREATE TABLE statement.
Sequence files are binary files containing serialized key/value pairs. You can compress a sequence file at the record (key-value pair) or block levels. This is one of the advantage of using sequence file. Also, sequebce files are binary files, they provide faster read/write than that of text file format.
First we should understand what problems does the SequenceFile try to solve, and then how can SequenceFile help to solve the problems.
Map tasks usually process a block of input at a time (using the default FileInputFormat).
The more the number of files is, the more number of Map task need and the job time can be much more slower.
These two cases require different solutions.
HAR files
SequenceFile
For example, suppose there are 10,000 100KB files, then we can write a program to put them into a single SequenceFile like below, where you can use filename to be the key and content to be the value.
(source: csdn.net)
Some benefits:
Supported Compressions, the file structure depends on the compression type.
Record-Compressed: Compresses each record as it’s added to the file.
(source: csdn.net)
Block-Compressed
(source: csdn.net)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With