Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Small files and HDFS blocks

Tags:

hadoop

hdfs

Does a block in Hadoop Distributed File System store multiple small files, or a block stores only 1 file?

like image 637
Eugen Avatar asked Dec 19 '11 14:12

Eugen


People also ask

What is the problem with small files in Hadoop?

If you're storing small files, then you probably have lots of them (otherwise you wouldn't turn to Hadoop), and the problem is that HDFS can't handle lots of files. Every file, directory and block in HDFS is represented as an object in the namenode's memory, each of which occupies 150 bytes, as a rule of thumb.

How does Hadoop handle small file size?

You can create MapReduce program convert lots of small files to into a single SequenceFile. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support block compression which is the best option.

What if file size is less than block size?

The default data block size of HDFS is 128 MB. When file size is significantly smaller than the block size the efficiency degrades.


1 Answers

Multiple files are not stored in a single block. BTW, a single file can be stored in multiple blocks. The mapping between the file and the block-ids is persisted in the NameNode.

According to the Hadoop : The Definitive Guide

Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.

HDFS is designed to handle large files. If there are too many small files then the NameNode might get loaded since it stores the name space for HDFS. Check this article on how to alleviate the problem with too many small files.

like image 153
Praveen Sripati Avatar answered Sep 17 '22 15:09

Praveen Sripati