Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Change File Split size in Hadoop

I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. That is, a 64mb file, which is the default split size for TextInputFormat, would take even several hours to be processed.

What I need to do, is to reduce the split size, so that I can utilize even more nodes for a job.

So the question is, how is it possible to split the files by let's say 10kb? Do I need to implement my own InputFormat and RecordReader for this, or is there any parameter to set? Thanks.

like image 912
Ahmedov Avatar asked Mar 13 '12 04:03

Ahmedov


People also ask

What is split size in Hadoop?

In Hadoop, the files split into 128 MB blocks and then stored into Hadoop Filesystem.

How is input split size calculated in Hadoop?

Suppose there is 1GB (1024 MB) of data needs to be stored and processed by the hadoop. So, while storing the 1GB of data in HDFS, hadoop will split this data into smaller chunk of data. Consider, hadoop system has default 128 MB as split data size. Then, hadoop will store the 1 TB data into 8 blocks (1024 / 128 = 8 ).

What is split size?

Split size is user defined value and you can choose your own split size based on your volume of data(How much data you are processing). Split is basically used to control number of Mapper in Map/Reduce program.


2 Answers

The parameter mapred.max.split.size which can be set per job individually is what you looking for. Don't change dfs.block.size because this is global for HDFS and can lead to problems.

like image 158
Brainlag Avatar answered Sep 21 '22 23:09

Brainlag


Hadoop the Definitive Guide, page 203 "The maximum split size defaults to the maximum value that can be represented by a Java long type. It has an effect only when it is less than the block size, forcing splits to be smaller than a block. The split size is calculated by the formula:

max(minimumSize, min(maximumSize, blockSize)) 

by default

minimumSize < blockSize < maximumSize 

so the split size is blockSize

For example,

Minimum Split Size 1 Maximum Split Size 32mb Block Size  64mb Split Size  32mb 

Hadoop Works better with a small number of large files than a large number of small files. One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a single file. If the file is very small ("small" means significantly smaller than an HDFS block) and there are a lot of them, then each map task will process very little input, and there will be a lot of them (one per file), each of which imposes extra bookkeeping overhead. Compare a 1gb file broken into sixteen 64mb blocks, and 10.000 or so 100kb files. The 10.000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file and 16 map tasks.


like image 35
Ahmedov Avatar answered Sep 21 '22 23:09

Ahmedov