Large Block Size in HDFS! How is the unused space accounted for?

Tags:

We all know that the block size in HDFS is pretty large (64M or 128M) as compared to the block size in traditional file systems. This is done in order to reduce the percentage of seek time compared to the transfer time (Improvements in transfer rate have been on a much larger scale than improvements on the disk seek time therefore, the goal while designing a file system is always to reduce the number of seeks in comparison to the amount of data to be transferred). But this comes with an additional disadvantage of internal fragmentation (which is why traditional file system block sizes are not so high and are only of the order of a few KBs - generally 4K or 8K).

I was going through the book - Hadoop, the Definitive Guide and found this written somewhere that a file smaller than the block size of HDFS does not occupy the full block and does not account for the full block's space but couldn't understand how? Can somebody please throw some light on this.

886

asked Oct 22 '12 13:10

Abhishek Jain

2 Answers

The block division in HDFS is just logically built over the physical blocks of underlying file system (e.g. ext3/fat). The file system is not physically divided into blocks( say of 64MB or 128MB or whatever may be the block size). It's just an abstraction to store the metadata in the NameNode. Since the NameNode has to load the entire metadata in memory therefore there is a limit to number of metadata entries thus explaining the need for a large block size.

Therefore, three 8MB files stored on HDFS logically occupies 3 blocks (3 metadata entries in NameNode) but physically occupies 8*3=24MB space in the underlying file system.

The large block size is to account for proper usage of storage space while considering the limit on the memory of NameNode.

190

answered Oct 22 '22 11:10

Satbir

According to the Hadoop - The Definitive Guide

Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage. When unqualified, the term “block” in this book refers to a block in HDFS.

Each block in HDFS is stored as a file in the Data Node on the underlying OS file system (ext3, ext4 etc) and the corresponding details are stored in the Name Node. Let's assume the file size is 200MB and the block size is 64MB. In this scenario, there will be 4 blocks for the file which will correspond to 4 files in Data Node of 64MB, 64MB, 64MB and 8MB size (assuming with a replication of 1).

An ls -ltr on the Data Node will show the block details

-rw-rw-r-- 1 training training 11 Oct 21 15:27 blk_-7636754311343966967_1002.meta
-rw-rw-r-- 1 training training 4 Oct 21 15:27 blk_-7636754311343966967
-rw-rw-r-- 1 training training 99 Oct 21 15:29 blk_-2464541116551769838_1003.meta
-rw-rw-r-- 1 training training 11403 Oct 21 15:29 blk_-2464541116551769838
-rw-rw-r-- 1 training training 99 Oct 21 15:29 blk_-2951058074740783562_1004.meta
-rw-rw-r-- 1 training training 11544 Oct 21 15:29 blk_-2951058074740783562

answered Oct 22 '22 12:10

Praveen Sripati

Related questions
                            
                                How to display pdf content in web page using html5 and jquery?
                            
                                ExitFailure 9 when trying to install ghc-mod using Cabal
                            
                                ios: uitableview section header anchor to top of table
                            
                                CREATE TRIGGER must be the first statement in a batch
                            
                                HashMap with weak values
                            
                                Why do the INC and DEC instructions *not* affect the Carry Flag (CF)?
                            
                                How can I setup tab size in Sublime text 2 for each file type?
                            
                                Get Userinfo from Google OAuth 2.0 PHP API
                            
                                How to check with jQuery if any form is submitted?
                            
                                What's the difference between $locationChangeSuccess and $locationChangeStart?
                            
                                Disable autofill on a web form through HTML or JavaScript?
                            
                                How to check if all items in a list are there in another list?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With