Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop - appropriate block size for unsplittable files of varying sizes (200-500mb)

Tags:

hadoop

hdfs

If I need to do a sequential scan of (non-splittable) thousands of gzip files of sizes varying between 200 and 500mb, what is an appropriate block size for these files?

For the sake of this question, let's say that the processing done is very fast, so restarting a mapper is not costly, even for large block sizes.

My understanding is:

  1. There's hardly an upper limit of block size, as there's "plenty of files" for an appropriate amount of mappers for the size of my cluster.
  2. To ensure data-locality, I want each gzip file to be in 1 block.

However, the gzipped files are of varying sizes. How is data stored if I choose a block size of ~500mb (e.g. max file size of all my input files)? Would it be better to pick a "very large" block size, like 2GB? Is HDD capacity wasted excessively in either scenario?

I guess I'm really asking how files are actually stored and split across hdfs blocks - as well as trying to gain an understanding of best practice for non-splittable files.

Update: A concrete example

Say I'm running a MR Job on three 200 MB files, stored as in the following illustration.

If HDFS stores files as in case A, 3 mappers would be guaranteed to be able to process a "local" file each. However, if the files are stored as in case B, one mapper would need to fetch part of file 2 from another data node.

Given there's plenty of free blocks, does HDFS store files as illustrated by case A or case B?

HDFS strategies

like image 888
jkgeyti Avatar asked Jun 21 '13 11:06

jkgeyti


People also ask

Why is block size set to 128 MB in Hadoop HDFS?

The default size of a block in HDFS is 128 MB (Hadoop 2. x) and 64 MB (Hadoop 1. x) which is much larger as compared to the Linux system where the block size is 4KB. The reason of having this huge block size is to minimize the cost of seek and reduce the meta data information generated per block.

How many blocks will be created for a file that is 300mb in HDFS?

If I want to store a file (example. txt) of size 300 MB in HDFS, it will be stored across three blocks as shown below.

How many blocks will a 200MB file be stored in in HDFS if we assume default HDFS block size for Hadoop v2 +?

As the default Block size of HDFS is 64MB . So if we have say for Example 200MB Data . According to The Block Size of HDFS It will Be Divide into 4 block of 64Mb ,64MB ,64MB and 8MB .

What is the block size of Hadoop?

A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks, and if possible, each chunk will reside on a different DataNode.


2 Answers

If you have non-splittable files then you are better off using larger block sizes - as large as the files themselves (or larger, it makes no difference).

If the block size is smaller than the overall filesize then you run into the possibility that all the blocks are not all on the same data node and you lose data locality. This isn't a problem with splittable files as a map task will be created for each block.

As for an upper limit for block size, i know that for certain older version of Hadoop, the limit was 2GB (above which the block contents were unobtainable) - see https://issues.apache.org/jira/browse/HDFS-96

There is no downside for storing smaller files with larger block sizes - to emphasize this point consider a 1MB and 2 GB file, each with a block size of 2 GB:

  • 1 MB - 1 block, single entry in the Name Node, 1 MB physically stored on each data node replica
  • 2 GB - 1 block, single entry in the Name node, 2 GB physically stored on each data node replica

So other that the required physical storage, there is no downside to the Name node block table (both files have a single entry in the block table).

The only possible downside is the time it takes to replicate a smaller versus larger block, but on the flip side if a data node is lost from the cluster, then tasking 2000 x 1 MB blocks to replicate is slower than a single block 2 GB block.

Update - a worked example

Seeing as this is causing some confusion, heres some worked examples:

Say we have a system with a 300 MB HDFS block size, and to make things simpler we have a psuedo cluster with only one data node.

If you want to store a 1100 MB file, then HDFS will break up that file into at most 300 MB blocks and store on the data node in special block indexed files. If you were to go to the data node and look at where it stores the indexed block files on physical disk you may see something like this:

/local/path/to/datanode/storage/0/blk_000000000000001  300 MB
/local/path/to/datanode/storage/0/blk_000000000000002  300 MB
/local/path/to/datanode/storage/0/blk_000000000000003  300 MB
/local/path/to/datanode/storage/0/blk_000000000000004  200 MB

Note that the file isn't exactly divisible by 300 MB, so the final block of the file is sized as the modulo of the file by the block size.

Now if we repeat the same exercise with a file smaller than the block size, say 1 MB, and look at how it would be stored on the data node:

/local/path/to/datanode/storage/0/blk_000000000000005  1 MB

Again note that the actual file stored on the data node is 1 MB, NOT a 200 MB file with 299 MB of zero padding (which i think is where the confusion is coming from).

Now where the block size does play a factor in efficiency is in the Name Node. For the above two examples, the name node needs to maintain a map of the file names, to block names and data node locations (as well as the total file size and block size):

filename     index     datanode
-------------------------------------------
fileA.txt    blk_01    datanode1
fileA.txt    blk_02    datanode1
fileA.txt    blk_03    datanode1
fileA.txt    blk_04    datanode1
-------------------------------------------
fileB.txt    blk_05    datanode1

You can see that if you were to use a block size of 1 MB for fileA.txt, you'd need 1100 entries in the above map rather than 4 (which would require more memory in the name node). Also pulling back all the blocks would be more expensive as you'd be making 1100 RPC calls to datanode1 rather than 4.

like image 93
Chris White Avatar answered Sep 20 '22 14:09

Chris White


I'll attempt to highlight by way of example the differences in blocks splits in reference to file size. In HDFS you have:

Splittable FileA size 1GB
dfs.block.size=67108864(~64MB)

MapRed job against this file:

16 splits and in turn 16 mappers.

Let's look at this scenario with a compressed (non-splittable) file:

Non-Splittable FileA.gzip size 1GB
dfs.block.size=67108864(~64MB)

MapRed job against this file:

16 Blocks will converge on 1 mapper.

It's best to proactively avoid this situation since it means that the tasktracker will have to fetch 16 blocks of data most of which will not be local to the tasktracker.

Finally, the relationships of the block, split and file can be summarized as follows:

                                                             block boundary
|BLOCK           |    BLOCK       |   BLOCK        |   BLOCK ||||||||
|FILE------------|----------------|----------------|---------|
|SPLIT            |                |                |        |

The split can extend beyond the block because the split depends on the InputFormat class definition of how to split the file which may not coincide with the block size so the split extends beyond to include the seek points within the source.

like image 35
Engineiro Avatar answered Sep 18 '22 14:09

Engineiro