Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop block size and file size issue?

Tags:

hadoop

hdfs

This might seem as a silly question but in Hadoop suppose blocksize is X (typically 64 or 128 MB) and a local filesize is Y (where Y is less than X).Now when I copy file Y to the HDFS will it consume one block or will hadoop create smaller size blocks?

like image 246
Slayer Avatar asked Jul 06 '12 20:07

Slayer


1 Answers

One block is consumed by Hadoop. That does not mean that storage capacity will be consumed in an equivalent manner.

The output while browsing the HDFS from web looks like this:

filename1   file    48.11 KB    3   128 MB  2012-04-24 18:36    
filename2   file    533.24 KB   3   128 MB  2012-04-24 18:36    
filename3   file    303.65 KB   3   128 MB  2012-04-24 18:37

You see that each file size is lesser than the block size which is 128 MB. These files are in KB. HDFS capacity is consumed based on the actual file size but a block is consumed per file.

There are limited number of blocks available dependent on the capacity of the HDFS. You are wasting blocks as you will run out of them before utilizing all the actual storage capacity. Remember that Unix filsystem also has concept of blocksize but is a very small number around 512 Bytes. This concept is inverted in HDFS where the block size is kept bigger around 64-128 MB.

The other issue is that when you run map/reduce programs it will try to spawn mapper per block so in this case when you are processing three small files, it may end up spawning three mappers to work on them eventually. This wastes resources when the files are of smaller size. You also add latency as each mapper takes time to spawn and then ultimately would work on a very small sized file. You have to compact them into files closer to blocksize to take advantage of mappers working on lesser number of files.

Yet another issue with numerous small files is that it loads namenode which keeps the mapping (metadata) of each block and chunk mapping in main memory. With smaller files, you fill up this table faster and more main memory will be required as metadata grows.

Read the following for reference:

  1. http://www.cloudera.com/blog/2009/02/the-small-files-problem/
  2. http://www.ibm.com/developerworks/web/library/wa-introhdfs/
  3. Oh! there is a discussion on SO : Small files and HDFS blocks
like image 146
pyfunc Avatar answered Oct 16 '22 15:10

pyfunc