Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large Block Size in HDFS! How is the unused space accounted for?

Tags:

We all know that the block size in HDFS is pretty large (64M or 128M) as compared to the block size in traditional file systems. This is done in order to reduce the percentage of seek time compared to the transfer time (Improvements in transfer rate have been on a much larger scale than improvements on the disk seek time therefore, the goal while designing a file system is always to reduce the number of seeks in comparison to the amount of data to be transferred). But this comes with an additional disadvantage of internal fragmentation (which is why traditional file system block sizes are not so high and are only of the order of a few KBs - generally 4K or 8K).

I was going through the book - Hadoop, the Definitive Guide and found this written somewhere that a file smaller than the block size of HDFS does not occupy the full block and does not account for the full block's space but couldn't understand how? Can somebody please throw some light on this.

like image 886
Abhishek Jain Avatar asked Oct 22 '12 13:10

Abhishek Jain


People also ask

Why block size in HDFS is large?

Why is a Block in HDFS So Large? HDFS blocks are huge than the disk blocks, and the explanation is to limit the expense of searching. The time or cost to transfer the data from the disk can be made larger than the time to seek for the beginning of the block by simply improving the size of blocks significantly.

How is block size calculated in HDFS?

Suppose we have a file of size 612 MB, and we are using the default block configuration (128 MB). Therefore five blocks are created, the first four blocks are 128 MB in size, and the fifth block is 100 MB in size (128*4+100=612).

What is the block size for storage Big data?

Ideally the Data Block size is 64MB or 128 MB or even 256MB in some cases.It can be increased/decreased as per the requirement. Basically the size of block depends on the size of the original file. Larger the file,larger the block-size,so the file is divided into less no of large blocks and thus fast processing.

What is the maximum block size of HDFS?

HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks, and if possible, each chunk will reside on a different DataNode.


2 Answers

The block division in HDFS is just logically built over the physical blocks of underlying file system (e.g. ext3/fat). The file system is not physically divided into blocks( say of 64MB or 128MB or whatever may be the block size). It's just an abstraction to store the metadata in the NameNode. Since the NameNode has to load the entire metadata in memory therefore there is a limit to number of metadata entries thus explaining the need for a large block size.

Therefore, three 8MB files stored on HDFS logically occupies 3 blocks (3 metadata entries in NameNode) but physically occupies 8*3=24MB space in the underlying file system.

The large block size is to account for proper usage of storage space while considering the limit on the memory of NameNode.

like image 190
Satbir Avatar answered Oct 22 '22 11:10

Satbir


According to the Hadoop - The Definitive Guide

Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage. When unqualified, the term “block” in this book refers to a block in HDFS.

Each block in HDFS is stored as a file in the Data Node on the underlying OS file system (ext3, ext4 etc) and the corresponding details are stored in the Name Node. Let's assume the file size is 200MB and the block size is 64MB. In this scenario, there will be 4 blocks for the file which will correspond to 4 files in Data Node of 64MB, 64MB, 64MB and 8MB size (assuming with a replication of 1).

An ls -ltr on the Data Node will show the block details

-rw-rw-r-- 1 training training 11 Oct 21 15:27 blk_-7636754311343966967_1002.meta
-rw-rw-r-- 1 training training 4 Oct 21 15:27 blk_-7636754311343966967
-rw-rw-r-- 1 training training 99 Oct 21 15:29 blk_-2464541116551769838_1003.meta
-rw-rw-r-- 1 training training 11403 Oct 21 15:29 blk_-2464541116551769838
-rw-rw-r-- 1 training training 99 Oct 21 15:29 blk_-2951058074740783562_1004.meta
-rw-rw-r-- 1 training training 11544 Oct 21 15:29 blk_-2951058074740783562

like image 30
Praveen Sripati Avatar answered Oct 22 '22 12:10

Praveen Sripati