Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HDFS block size Vs actual file size

Tags:

filesize

hdfs

I know that HDFS stores data using the regular linux file system in the data nodes. My HDFS block size is 128 MB. Lets say that I have 10 GB of disk space in my hadoop cluster that means, HDFS initially has 80 blocks as available storage.

If I create a small file of say 12.8 MB, #available HDFS blocks will become 79. What happens if I create another small file of 12.8 MB? Will the #availbale blocks stay at 79 or will it come down to 78? In the former case, HDFS basically recalculates the #available blocks after each block allocation based on the available free disk space so, #available blocks will become 78 only after more than 128 MB of disk space is consumed. Please clarify.

like image 224
NPE Avatar asked Feb 25 '13 08:02

NPE


People also ask

Why is the block size very large in HDFS as compared to the traditional file system?

HDFS blocks are large compared to disk blocks, because to minimize the cost of seeks. If we have many smaller size disk blocks, the seek time would be maximum (time spent to seek/look for an information).

Why is the HDFS data block size larger than the disk block size?

HDFS blocks are huge than the disk blocks, and the explanation is to limit the expense of searching. The time or cost to transfer the data from the disk can be made larger than the time to seek for the beginning of the block by simply improving the size of blocks significantly.

Why is block size set to 128 MB in HDFS?

The default size of a block in HDFS is 128 MB (Hadoop 2. x) and 64 MB (Hadoop 1. x) which is much larger as compared to the Linux system where the block size is 4KB. The reason of having this huge block size is to minimize the cost of seek and reduce the meta data information generated per block.

What if file size is less than block size?

The files smaller than the block size do not occupy the full block size. The size of HDFS data blocks is large in order to reduce the cost of seek and network traffic.


1 Answers

The best way to know is to try it, see my results bellow.

But before trying, my guess is that even if you can only allocate 80 full blocks in your configuration, you can allocate more than 80 non-empty files. This is because I think HDFS does not use a full block each time you allocate a non-empty file. Said in another way, HDFS blocks are not a storage allocation unit, but a replication unit. I think the storage allocation unit of HDFS is the unit of the underlying filesystem (if you use ext4 with a block size of 4 KB and you create a 1 KB file in a cluster with replication factor of 3, you consume 3 times 4 KB = 12 KB of hard disk space).

Enough guessing and thinking, let's try it. My lab configuration is as follow:

  • hadoop version 1.0.4
  • 4 data nodes, each with a little less than 5.0G of available space, ext4 block size of 4K
  • block size of 64 MB, default replication of 1

After starting HDFS, I have the following NameNode summary:

  • 1 files and directories, 0 blocks = 1 total
  • DFS Used: 112 KB
  • DFS Remaining: 19.82 GB

Then I do the following commands:

  • hadoop fs -mkdir /test
  • for f in $(seq 1 10); do hadoop fs -copyFromLocal ./1K_file /test/$f; done

With these results:

  • 12 files and directories, 10 blocks = 22 total
  • DFS Used: 122.15 KB
  • DFS Remaining: 19.82 GB

So the 10 files did not consume 10 times 64 MB (no modification of "DFS Remaining").

like image 151
jfg956 Avatar answered Oct 30 '22 22:10

jfg956