I am new to Map-reduce and I want to understand what is sequence file data input? I studied in the Hadoop book but it was hard for me to understand.

First we should understand what problems does the SequenceFile try to solve, and then how can SequenceFile help to solve the problems. <h3>In HDFS</h3> <ul> <li>SequenceFile is one of the solutions to small file problem in Hadoop.</li> <li>Small file is significantly smaller than the HDFS block size(128MB).</li> <li>Each file, directory, block in HDFS is represented as object and occupies 150 bytes.</li> <li>10 million files, would use about 3 gigabytes of memory of NameNode.</li> <li>A billion files is not feasible.</li> </ul> <h3>In MapReduce</h3> <ul> <li>Map tasks usually process a block of input at a time (using the default FileInputFormat).</li> <li>The more the number of files is, the more number of Map task need and the job time can be much more slower.</li> </ul> <h3>Small file scenarios</h3> <ul> <li>The files are pieces of a larger logical file.</li> <li>The files are inherently small, for example, images.</li> </ul> These two cases require different solutions. <ul> <li>For first one, write a program to concatenate the small files together.(see Nathan Marz’s post about a tool called the Consolidator which does exactly this)</li> <li>For the second one, some kind of container is needed to group the files in some way.</li> </ul> <h3>Solutions in Hadoop</h3> HAR files <ul> <li>HAR(Hadoop Archives) were introduced to alleviate the problem of lots of files putting pressure on the namenode’s memory.</li> <li>HARs are probably best used purely for archival purposes.</li> </ul> SequenceFile <ul> <li>The concept of SequenceFile is to put each small file to a larger single file.</li> <li> For example, suppose there are 10,000 100KB files, then we can write a program to put them into a single SequenceFile like below, where you can use filename to be the key and content to be the value. <img src="https://i.stack.imgur.com/TAi9n.png" alt="SequenceFile File Layout"> (source: csdn.net) </li> <li> Some benefits: <ol> <li>A smaller number of memory needed on NameNode. Continue with the 10,000 100KB files example, <ul> <li>Before using SequenceFile, 10,000 objects occupy about 4.5MB of RAM in NameNode.</li> <li>After using SequenceFile, 1GB SequenceFile with 8 HDFS blocks, these objects occupy about 3.6KB of RAM in NameNode. </li> </ul> </li> <li>SequenceFile is splittable, so is suitable for MapReduce.</li> <li>SequenceFile is compression supported.</li> </ol> </li> <li> Supported Compressions, the file structure depends on the compression type. <ol> <li>Uncompressed</li> <li>Record-Compressed: Compresses each record as it’s added to the file. <img src="https://i.stack.imgur.com/vy5SU.png" alt="record_compress_seq"> (source: csdn.net)</li> <li> Block-Compressed <img src="https://i.stack.imgur.com/LmiHb.png" alt="这里写图片描述"> (source: csdn.net) <ul> <li>Waits until data reaches block size to compress.</li> <li>Block compression provide better compression ratio than Record compression.</li> <li>Block compression is generally the preferred option when using SequenceFile.</li> <li> Block here is unrelated to HDFS or filesystem block.</li> </ul> </li> </ol> </li> </ul>

What is sequence file in hadoop?

1 Answers

First we should understand what problems does the SequenceFile try to solve, and then how can SequenceFile help to solve the problems.

In HDFS

SequenceFile is one of the solutions to small file problem in Hadoop.
Small file is significantly smaller than the HDFS block size(128MB).
Each file, directory, block in HDFS is represented as object and occupies 150 bytes.
10 million files, would use about 3 gigabytes of memory of NameNode.
A billion files is not feasible.

In MapReduce

Map tasks usually process a block of input at a time (using the default FileInputFormat).
The more the number of files is, the more number of Map task need and the job time can be much more slower.

Small file scenarios

The files are pieces of a larger logical file.
The files are inherently small, for example, images.

These two cases require different solutions.

For first one, write a program to concatenate the small files together.(see Nathan Marz’s post about a tool called the Consolidator which does exactly this)
For the second one, some kind of container is needed to group the files in some way.

Solutions in Hadoop

HAR files

HAR(Hadoop Archives) were introduced to alleviate the problem of lots of files putting pressure on the namenode’s memory.
HARs are probably best used purely for archival purposes.

SequenceFile

The concept of SequenceFile is to put each small file to a larger single file.
For example, suppose there are 10,000 100KB files, then we can write a program to put them into a single SequenceFile like below, where you can use filename to be the key and content to be the value.

_{(source: csdn.net)}
Some benefits:
1. A smaller number of memory needed on NameNode. Continue with the 10,000 100KB files example,
  - Before using SequenceFile, 10,000 objects occupy about 4.5MB of RAM in NameNode.
  - After using SequenceFile, 1GB SequenceFile with 8 HDFS blocks, these objects occupy about 3.6KB of RAM in NameNode.
2. SequenceFile is splittable, so is suitable for MapReduce.
3. SequenceFile is compression supported.
Supported Compressions, the file structure depends on the compression type.
1. Uncompressed
2. Record-Compressed: Compresses each record as it’s added to the file.
  _{(source: csdn.net)}
3. Block-Compressed
  _{(source: csdn.net)}
  - Waits until data reaches block size to compress.
  - Block compression provide better compression ratio than Record compression.
  - Block compression is generally the preferred option when using SequenceFile.
  - Block here is unrelated to HDFS or filesystem block.

answered Sep 28 '22 01:09

JiaMing Lin

Related questions
                            
                                How do I count the characters, words, and lines in a file, using Perl?
                            
                                Multiple Threads reading from the same file
                            
                                Golang: Issues replacing newlines in a string from a text file
                            
                                Haskell read lines of file
                            
                                How do I replace lines in the middle of a file with Perl?
                            
                                How can I read/stream a file without loading the entire file into memory?
                            
                                Edit each line in a file in Ruby
                            
                                load parameters from a file in Python
                            
                                Invalid escape sequence (valid ones are \b \t \n \f \r \" \' \\ )
                            
                                How to read file attributes in a directory?
                            
                                Find and replace within a text file using Python
                            
                                How to use NIO to write InputStream to File?
                            
                                How to send byte[] as pdf to browser in java web application?
                            
                                Checking if file exists in asp.net mvc 5
                            
                                How to check if a directory doesn't exist in make and create it
                            
                                Angular 6 post-request with a multipart form doesn't include the attached file of the object posted
                            
                                Getting absolute path to the file inside the public folder in Rails
                            
                                how to transfer a file through SFTP in java? [duplicate]
                            
                                Can webpack report which file triggered a compilation in watch mode?
                            
                                Get file modify date in C# [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is sequence file in hadoop?

Tags:

file

input

sequence

hadoop

mapreduce

Soghra Gargari

People also ask