I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. That is, a <code>64mb</code> file, which is the default split size for <code>TextInputFormat</code>, would take even several hours to be processed. What I need to do, is to reduce the split size, so that I can utilize even more nodes for a job. So the question is, how is it possible to split the files by let's say <code>10kb</code>? Do I need to implement my own <code>InputFormat</code> and <code>RecordReader</code> for this, or is there any parameter to set? Thanks.

The parameter <code>mapred.max.split.size</code> which can be set per job individually is what you looking for. Don't change <code>dfs.block.size</code> because this is global for HDFS and can lead to problems.

Change File Split size in Hadoop

Tags:

java

distributed-computing

hadoop

mapreduce

I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. That is, a 64mb file, which is the default split size for TextInputFormat, would take even several hours to be processed.

What I need to do, is to reduce the split size, so that I can utilize even more nodes for a job.

So the question is, how is it possible to split the files by let's say 10kb? Do I need to implement my own InputFormat and RecordReader for this, or is there any parameter to set? Thanks.

912

asked Mar 13 '12 04:03

Ahmedov

2 Answers

The parameter mapred.max.split.size which can be set per job individually is what you looking for. Don't change dfs.block.size because this is global for HDFS and can lead to problems.

158

answered Sep 21 '22 23:09

Brainlag

Hadoop the Definitive Guide, page 203 "The maximum split size defaults to the maximum value that can be represented by a Java long type. It has an effect only when it is less than the block size, forcing splits to be smaller than a block. The split size is calculated by the formula:

max(minimumSize, min(maximumSize, blockSize))

by default

minimumSize < blockSize < maximumSize

so the split size is blockSize

For example,

Minimum Split Size 1 Maximum Split Size 32mb Block Size  64mb Split Size  32mb

Hadoop Works better with a small number of large files than a large number of small files. One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a single file. If the file is very small ("small" means significantly smaller than an HDFS block) and there are a lot of them, then each map task will process very little input, and there will be a lot of them (one per file), each of which imposes extra bookkeeping overhead. Compare a 1gb file broken into sixteen 64mb blocks, and 10.000 or so 100kb files. The 10.000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file and 16 map tasks.

answered Sep 21 '22 23:09

Ahmedov

Related questions
                            
                                What's the best way to check if a String contains a URL in Java/Android?
                            
                                Read a pdf file from assets folder
                            
                                No Main class found in NetBeans
                            
                                What is the use of LinkedHashMap.removeEldestEntry?
                            
                                Maximum lines of code permitted in a Java class?
                            
                                Best way to control concurrent access to Java collections
                            
                                XSL Transformation in Java with parameters
                            
                                Compact syntax for instantiating an initializing collection
                            
                                What is the buffer size in BufferedReader?
                            
                                In proguard, how to preserve a set of classes' method names?
                            
                                Could not autowire field in spring. why?
                            
                                resolve a java.util.ArrayList$SubList notSerializable Exception
                            
                                Programmatically Start OSGi (Equinox)?
                            
                                Accessing scala.None from Java
                            
                                Remove filename from a URL/Path in java
                            
                                Convert an existing Flutter Kotlin project to Flutter Java project
                            
                                why I have HTTP 403 from repo spring?
                            
                                Concurrent updates handling in hibernate
                            
                                Can I have an abstract builder class in java with method chaining without doing unsafe operations?
                            
                                Passing JVM arguments to Tomcat when running as a service?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Change File Split size in Hadoop

Tags:

java

distributed-computing

hadoop

mapreduce

Ahmedov

People also ask

2 Answers

Brainlag

Ahmedov

Recent Activity

Donate For Us