Does a block in Hadoop Distributed File System store multiple small files, or a block stores only 1 file?
If you're storing small files, then you probably have lots of them (otherwise you wouldn't turn to Hadoop), and the problem is that HDFS can't handle lots of files. Every file, directory and block in HDFS is represented as an object in the namenode's memory, each of which occupies 150 bytes, as a rule of thumb.
You can create MapReduce program convert lots of small files to into a single SequenceFile. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support block compression which is the best option.
The default data block size of HDFS is 128 MB. When file size is significantly smaller than the block size the efficiency degrades.
Multiple files are not stored in a single block. BTW, a single file can be stored in multiple blocks. The mapping between the file and the block-ids is persisted in the NameNode.
According to the Hadoop : The Definitive Guide
Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.
HDFS is designed to handle large files. If there are too many small files then the NameNode might get loaded since it stores the name space for HDFS. Check this article on how to alleviate the problem with too many small files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With