I have data sets in a magnitude of 3-digit GBs or even 1 or 2-digit TB. The input files are therefore a list of files, each sized like 10GB. My map reduce job in hadoop processes all these files and then gives only one output file (with the aggregated information). My questions are: <ol> <li>What is the appropriate file size for tuning up the hadoop/mapreduce framework from Apache? I hear that bigger file sizes are more preferred than the small ones. Have any ideas? The only thing I know for sure is that hadoop reads blocks, each with 64MB by default. So it would be good if the file size is kind of multiplicator of 64MB.</li> <li>At the moment, my application is writing the output file into only one file. The file size is then of course 3-digit gigabit. I am wondering how efficiently I can partition the file. Of course I can just use some unix tools to do this job. But is it preferred to do this directly in hadoop?</li> </ol> Thx for your comments! P.S.: I am not compressing the files. The file format of the input files is text/csv.

If you're not compressing the files then hadoop will process your large files (say 10G), with a number of mappers related to the block size of the file. Say your block size is 64M, then you will have ~160 mappers processing this 10G file (160*64 ~= 10G). Depending on how CPU intensive your mapper logic is, this might be an acceptable blocks size, but if you find that your mappers are executing in sub minute times, then you might want to increase the work done by each mapper (by increasing the block size to 128, 256, 512m - the actual size depends on how you intend to process the data). A larger blocks size will reduce the number of mappers used to process the 10G file. You can of course increase the minimum split size used by the TextInputFormat, but then you'll most probably run into lower data locality as the mapper may be processing 2 or more blocks, which may not all reside locally on that node. As for output, this again depends on what your processing logic is doing - can you partition just by introducing more reducers? This will create more output files, but what partitioning logic do you require for these files (by default they will be hash partitioned by your key)

Hadoop divides up work based on the input split size. It divides your total data size by your split size and that's how it determines how many map jobs will occur. The general consensus is that you want between 10-100 maps per machine; from http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html <blockquote> The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files. The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute. </blockquote> With some input formats you can set the split size, by default most (including TextInputFormat) create one map per block. So, if you have several different files you'll end up with more non-complete 64mb blocks which is a waste of a map. Processing one giant file is much more efficient than processing several files. The setup for the job takes longer when it has to account for multiple files. The core of hadoop was really centered around small numbers of large files. Also, HDFS is setup to handle small numbers of large files and the more files you have the more ram the namenode will eat in order to keep track of them.

Hadoop MapReduce: Appropriate input files size?

Tags:

file

size

hadoop

mapreduce

I have data sets in a magnitude of 3-digit GBs or even 1 or 2-digit TB. The input files are therefore a list of files, each sized like 10GB. My map reduce job in hadoop processes all these files and then gives only one output file (with the aggregated information).

My questions are:

What is the appropriate file size for tuning up the hadoop/mapreduce framework from Apache? I hear that bigger file sizes are more preferred than the small ones. Have any ideas? The only thing I know for sure is that hadoop reads blocks, each with 64MB by default. So it would be good if the file size is kind of multiplicator of 64MB.
At the moment, my application is writing the output file into only one file. The file size is then of course 3-digit gigabit. I am wondering how efficiently I can partition the file. Of course I can just use some unix tools to do this job. But is it preferred to do this directly in hadoop?

Thx for your comments!

P.S.: I am not compressing the files. The file format of the input files is text/csv.

816

asked Jun 13 '12 12:06

Bob

3 Answers

If you're not compressing the files then hadoop will process your large files (say 10G), with a number of mappers related to the block size of the file.

Say your block size is 64M, then you will have ~160 mappers processing this 10G file (160*64 ~= 10G). Depending on how CPU intensive your mapper logic is, this might be an acceptable blocks size, but if you find that your mappers are executing in sub minute times, then you might want to increase the work done by each mapper (by increasing the block size to 128, 256, 512m - the actual size depends on how you intend to process the data).

A larger blocks size will reduce the number of mappers used to process the 10G file. You can of course increase the minimum split size used by the TextInputFormat, but then you'll most probably run into lower data locality as the mapper may be processing 2 or more blocks, which may not all reside locally on that node.

As for output, this again depends on what your processing logic is doing - can you partition just by introducing more reducers? This will create more output files, but what partitioning logic do you require for these files (by default they will be hash partitioned by your key)

146

answered Oct 22 '22 02:10

Chris White

Size of the input files:

One way to tune this is to look at how fast your map tasks are completing. Each map task will take in 1 file as input and if they are completing in under 30-40 seconds than you should consider increasing the size of each file so that each mapper has more work to do. This is because a map task takes about 30 seconds to initialize before it does any actual work.

It also depends on how many map tasks your cluster can run at one time. You can try to tune your file and block sizes so that you take advantage of as many map tasks as possible. See this blog post for more ideas: http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/

Size of output files:

The simple way to do this is to specify more than one reducer (each reducer will produce a single output file). If you want to partition your results by some key (say, year-month) you can include that in the output key of your map task and they will be sorted to the same reducer. Then you just need to check each file to see what year-month key it has.

Compression:

I recommend that you look at compressing your files. Doing this will make the input files "bigger" since each one will contain more data for a single map task to operate on. It will also reduce the amount of disk you use in your cluster. If anything, it might also increase the performance of mapreduce on your cluster because less disk I/O and network traffic will occur from reading and moving the files around.

Also, compress the intermediate output of your map task (output from the map task before it goes to the reducer). It will increase performance in a similar manner. This is done by setting mapred.compress.map.output=true.

answered Oct 22 '22 01:10

Jeff Wu

Hadoop divides up work based on the input split size. It divides your total data size by your split size and that's how it determines how many map jobs will occur. The general consensus is that you want between 10-100 maps per machine; from http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files. The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.

With some input formats you can set the split size, by default most (including TextInputFormat) create one map per block. So, if you have several different files you'll end up with more non-complete 64mb blocks which is a waste of a map.

Processing one giant file is much more efficient than processing several files. The setup for the job takes longer when it has to account for multiple files. The core of hadoop was really centered around small numbers of large files. Also, HDFS is setup to handle small numbers of large files and the more files you have the more ram the namenode will eat in order to keep track of them.

answered Oct 22 '22 03:10

Brian Griffey

Related questions
                            
                                How to create File object in Java without saving to hard disk
                            
                                Determining encoding for a file in Ruby
                            
                                Serialization of large struct to disk with Serde and Bincode is slow [duplicate]
                            
                                How Do You Do File Locking in Raku?
                            
                                How to list contents of a server directory using JSP?
                            
                                Clojure: Equivalent to Common Lisp READ function?
                            
                                Problem accessing config files within a Python egg
                            
                                Relative file paths for Excel Data Connections
                            
                                How to write a file to disk and insert a database record in a single transaction?
                            
                                Java.io.File.length() returns 0
                            
                                Java - Copy file to another directory using FileUtils and copyFileToDirectory - doesn't work -?
                            
                                Accessing the attributes of a file connection created via file()
                            
                                For three digit exponents Fortran drops the 'E' in the output
                            
                                Splitting a File into Chunks with Javascript
                            
                                How to write incrementally to a text file and flush output
                            
                                Get a File or URI object for a file inside an archive with Java?
                            
                                Setting content type in Java for file download
                            
                                Why does FileStream.Position increment in multiples of 1024?
                            
                                Is it possible to prepend data to an file without rewriting?
                            
                                Converting image to binary array (blob) with HTML5

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With