Hadoop - appropriate block size for unsplittable files of varying sizes (200-500mb)

Tags:

hdfs

If I need to do a sequential scan of (non-splittable) thousands of gzip files of sizes varying between 200 and 500mb, what is an appropriate block size for these files?

For the sake of this question, let's say that the processing done is very fast, so restarting a mapper is not costly, even for large block sizes.

My understanding is:

There's hardly an upper limit of block size, as there's "plenty of files" for an appropriate amount of mappers for the size of my cluster.
To ensure data-locality, I want each gzip file to be in 1 block.

However, the gzipped files are of varying sizes. How is data stored if I choose a block size of ~500mb (e.g. max file size of all my input files)? Would it be better to pick a "very large" block size, like 2GB? Is HDD capacity wasted excessively in either scenario?

I guess I'm really asking how files are actually stored and split across hdfs blocks - as well as trying to gain an understanding of best practice for non-splittable files.

Update: A concrete example

Say I'm running a MR Job on three 200 MB files, stored as in the following illustration.

If HDFS stores files as in case A, 3 mappers would be guaranteed to be able to process a "local" file each. However, if the files are stored as in case B, one mapper would need to fetch part of file 2 from another data node.

Given there's plenty of free blocks, does HDFS store files as illustrated by case A or case B?

HDFS strategies

888

asked Jun 21 '13 11:06

jkgeyti

2 Answers

If you have non-splittable files then you are better off using larger block sizes - as large as the files themselves (or larger, it makes no difference).

If the block size is smaller than the overall filesize then you run into the possibility that all the blocks are not all on the same data node and you lose data locality. This isn't a problem with splittable files as a map task will be created for each block.

As for an upper limit for block size, i know that for certain older version of Hadoop, the limit was 2GB (above which the block contents were unobtainable) - see https://issues.apache.org/jira/browse/HDFS-96

There is no downside for storing smaller files with larger block sizes - to emphasize this point consider a 1MB and 2 GB file, each with a block size of 2 GB:

1 MB - 1 block, single entry in the Name Node, 1 MB physically stored on each data node replica
2 GB - 1 block, single entry in the Name node, 2 GB physically stored on each data node replica

So other that the required physical storage, there is no downside to the Name node block table (both files have a single entry in the block table).

The only possible downside is the time it takes to replicate a smaller versus larger block, but on the flip side if a data node is lost from the cluster, then tasking 2000 x 1 MB blocks to replicate is slower than a single block 2 GB block.

Update - a worked example

Seeing as this is causing some confusion, heres some worked examples:

Say we have a system with a 300 MB HDFS block size, and to make things simpler we have a psuedo cluster with only one data node.

If you want to store a 1100 MB file, then HDFS will break up that file into at most 300 MB blocks and store on the data node in special block indexed files. If you were to go to the data node and look at where it stores the indexed block files on physical disk you may see something like this:

/local/path/to/datanode/storage/0/blk_000000000000001  300 MB
/local/path/to/datanode/storage/0/blk_000000000000002  300 MB
/local/path/to/datanode/storage/0/blk_000000000000003  300 MB
/local/path/to/datanode/storage/0/blk_000000000000004  200 MB

Note that the file isn't exactly divisible by 300 MB, so the final block of the file is sized as the modulo of the file by the block size.

Now if we repeat the same exercise with a file smaller than the block size, say 1 MB, and look at how it would be stored on the data node:

/local/path/to/datanode/storage/0/blk_000000000000005  1 MB

Again note that the actual file stored on the data node is 1 MB, NOT a 200 MB file with 299 MB of zero padding (which i think is where the confusion is coming from).

Now where the block size does play a factor in efficiency is in the Name Node. For the above two examples, the name node needs to maintain a map of the file names, to block names and data node locations (as well as the total file size and block size):

filename     index     datanode
-------------------------------------------
fileA.txt    blk_01    datanode1
fileA.txt    blk_02    datanode1
fileA.txt    blk_03    datanode1
fileA.txt    blk_04    datanode1
-------------------------------------------
fileB.txt    blk_05    datanode1

You can see that if you were to use a block size of 1 MB for fileA.txt, you'd need 1100 entries in the above map rather than 4 (which would require more memory in the name node). Also pulling back all the blocks would be more expensive as you'd be making 1100 RPC calls to datanode1 rather than 4.

answered Sep 20 '22 14:09

Chris White

I'll attempt to highlight by way of example the differences in blocks splits in reference to file size. In HDFS you have:

Splittable FileA size 1GB
dfs.block.size=67108864(~64MB)

MapRed job against this file:

16 splits and in turn 16 mappers.

Let's look at this scenario with a compressed (non-splittable) file:

Non-Splittable FileA.gzip size 1GB
dfs.block.size=67108864(~64MB)

MapRed job against this file:

16 Blocks will converge on 1 mapper.

It's best to proactively avoid this situation since it means that the tasktracker will have to fetch 16 blocks of data most of which will not be local to the tasktracker.

Finally, the relationships of the block, split and file can be summarized as follows:

                                                             block boundary
|BLOCK           |    BLOCK       |   BLOCK        |   BLOCK ||||||||
|FILE------------|----------------|----------------|---------|
|SPLIT            |                |                |        |

The split can extend beyond the block because the split depends on the InputFormat class definition of how to split the file which may not coincide with the block size so the split extends beyond to include the seek points within the source.

answered Sep 18 '22 14:09

Engineiro

Related questions
                            
                                spark-submit continues to hang after job completion
                            
                                Sorted word count using Hadoop MapReduce
                            
                                GlusterFS as the backend for Hadoop
                            
                                Build custom join logic in Cascading ensuring MAP_SIDE only
                            
                                Write Dataframe to Phoenix
                            
                                Sqoop - Import Job failed
                            
                                Use multiple Guava versions in same maven project
                            
                                MapReduce Output ArrayWritable
                            
                                How to write Map/Reduce tasks in Golang?
                            
                                How do I process a graph that is constantly updating, with low latency?
                            
                                Does Spark allow to use Amazon Assumed Role and STS temporary credentials for DynamoDB?
                            
                                HBase multiple column families performance
                            
                                Unable to connect to Hive2 using Python
                            
                                Designing HBase schema to best support specific queries
                            
                                Running wordcount sample using MRV1 on CDH4.0.1 VM
                            
                                Evaluate expression in HIVE set statements
                            
                                Accessing hive metastore using jdbc with kerberos keytab
                            
                                How to remove duplicate columns after a JOIN in Pig?
                            
                                How to efficiently update Impala tables whose files are modified very frequently

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With