Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How HDFS calculate the available blocks?

Tags:

hadoop

hdfs

Assuming that the block size is 128MB, the cluster has 10GB (so ~80 available blocks). Suppose that I have created 10 small files which together take 128MB on disk (block files, checksums, replication...) and 10 HDFS blocks. If I want to add another small file to HDFS, then what does HDFS use, the used blocks or the actual disk usage, to calculate the available blocks?

80 blocks - 10 blocks = 70 available blocks or (10 GB - 128 MB)/128 MB = 79 available blocks?

Thanks.

like image 773
Bao Bui Avatar asked Feb 17 '23 00:02

Bao Bui


1 Answers

Block size is just an indication to HDFS how to split up and distribute the files across the cluster - there is not a physically reserved number of blocks in HDFS (you can change the block size for each individual file if you wish)

For your example, you need to also take into consideration the replication factor and checksum files, but essentially adding lots of small files (less than the block size) does not mean that you have wasted 'available blocks' - they take up as much room as they need (again you need to remember that replication will increase the physical data footprint required to store the file) and the number of 'available blocks' will be closer to your second calculation.

A final note - having lots to small files means that your name node will require more memory to track them (blocks sizes, locations etc), and its generally less efficient to process 128x1MB files than single 128MB file (although that depends on how you're processing it)

like image 159
Chris White Avatar answered Feb 24 '23 11:02

Chris White