If I need to do a sequential scan of (non-splittable) thousands of gzip files of sizes varying between 200 and 500mb, what is an appropriate block size for these files?
For the sake of this question, let's say that the processing done is very fast, so restarting a mapper is not costly, even for large block sizes.
My understanding is:
However, the gzipped files are of varying sizes. How is data stored if I choose a block size of ~500mb (e.g. max file size of all my input files)? Would it be better to pick a "very large" block size, like 2GB? Is HDD capacity wasted excessively in either scenario?
I guess I'm really asking how files are actually stored and split across hdfs blocks - as well as trying to gain an understanding of best practice for non-splittable files.
Update: A concrete example
Say I'm running a MR Job on three 200 MB files, stored as in the following illustration.
If HDFS stores files as in case A, 3 mappers would be guaranteed to be able to process a "local" file each. However, if the files are stored as in case B, one mapper would need to fetch part of file 2 from another data node.
Given there's plenty of free blocks, does HDFS store files as illustrated by case A or case B?
The default size of a block in HDFS is 128 MB (Hadoop 2. x) and 64 MB (Hadoop 1. x) which is much larger as compared to the Linux system where the block size is 4KB. The reason of having this huge block size is to minimize the cost of seek and reduce the meta data information generated per block.
If I want to store a file (example. txt) of size 300 MB in HDFS, it will be stored across three blocks as shown below.
As the default Block size of HDFS is 64MB . So if we have say for Example 200MB Data . According to The Block Size of HDFS It will Be Divide into 4 block of 64Mb ,64MB ,64MB and 8MB .
A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks, and if possible, each chunk will reside on a different DataNode.
If you have non-splittable files then you are better off using larger block sizes - as large as the files themselves (or larger, it makes no difference).
If the block size is smaller than the overall filesize then you run into the possibility that all the blocks are not all on the same data node and you lose data locality. This isn't a problem with splittable files as a map task will be created for each block.
As for an upper limit for block size, i know that for certain older version of Hadoop, the limit was 2GB (above which the block contents were unobtainable) - see https://issues.apache.org/jira/browse/HDFS-96
There is no downside for storing smaller files with larger block sizes - to emphasize this point consider a 1MB and 2 GB file, each with a block size of 2 GB:
So other that the required physical storage, there is no downside to the Name node block table (both files have a single entry in the block table).
The only possible downside is the time it takes to replicate a smaller versus larger block, but on the flip side if a data node is lost from the cluster, then tasking 2000 x 1 MB blocks to replicate is slower than a single block 2 GB block.
Update - a worked example
Seeing as this is causing some confusion, heres some worked examples:
Say we have a system with a 300 MB HDFS block size, and to make things simpler we have a psuedo cluster with only one data node.
If you want to store a 1100 MB file, then HDFS will break up that file into at most 300 MB blocks and store on the data node in special block indexed files. If you were to go to the data node and look at where it stores the indexed block files on physical disk you may see something like this:
/local/path/to/datanode/storage/0/blk_000000000000001 300 MB
/local/path/to/datanode/storage/0/blk_000000000000002 300 MB
/local/path/to/datanode/storage/0/blk_000000000000003 300 MB
/local/path/to/datanode/storage/0/blk_000000000000004 200 MB
Note that the file isn't exactly divisible by 300 MB, so the final block of the file is sized as the modulo of the file by the block size.
Now if we repeat the same exercise with a file smaller than the block size, say 1 MB, and look at how it would be stored on the data node:
/local/path/to/datanode/storage/0/blk_000000000000005 1 MB
Again note that the actual file stored on the data node is 1 MB, NOT a 200 MB file with 299 MB of zero padding (which i think is where the confusion is coming from).
Now where the block size does play a factor in efficiency is in the Name Node. For the above two examples, the name node needs to maintain a map of the file names, to block names and data node locations (as well as the total file size and block size):
filename index datanode
-------------------------------------------
fileA.txt blk_01 datanode1
fileA.txt blk_02 datanode1
fileA.txt blk_03 datanode1
fileA.txt blk_04 datanode1
-------------------------------------------
fileB.txt blk_05 datanode1
You can see that if you were to use a block size of 1 MB for fileA.txt, you'd need 1100 entries in the above map rather than 4 (which would require more memory in the name node). Also pulling back all the blocks would be more expensive as you'd be making 1100 RPC calls to datanode1 rather than 4.
I'll attempt to highlight by way of example the differences in blocks splits in reference to file size. In HDFS you have:
Splittable FileA size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 splits and in turn 16 mappers.
Let's look at this scenario with a compressed (non-splittable) file:
Non-Splittable FileA.gzip size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 Blocks will converge on 1 mapper.
It's best to proactively avoid this situation since it means that the tasktracker will have to fetch 16 blocks of data most of which will not be local to the tasktracker.
Finally, the relationships of the block, split and file can be summarized as follows:
block boundary
|BLOCK | BLOCK | BLOCK | BLOCK ||||||||
|FILE------------|----------------|----------------|---------|
|SPLIT | | | |
The split can extend beyond the block because the split depends on the InputFormat class definition of how to split the file which may not coincide with the block size so the split extends beyond to include the seek points within the source.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With