I have data sets in a magnitude of 3-digit GBs or even 1 or 2-digit TB. The input files are therefore a list of files, each sized like 10GB. My map reduce job in hadoop processes all these files and then gives only one output file (with the aggregated information).
My questions are:
What is the appropriate file size for tuning up the hadoop/mapreduce framework from Apache? I hear that bigger file sizes are more preferred than the small ones. Have any ideas? The only thing I know for sure is that hadoop reads blocks, each with 64MB by default. So it would be good if the file size is kind of multiplicator of 64MB.
At the moment, my application is writing the output file into only one file. The file size is then of course 3-digit gigabit. I am wondering how efficiently I can partition the file. Of course I can just use some unix tools to do this job. But is it preferred to do this directly in hadoop?
Thx for your comments!
P.S.: I am not compressing the files. The file format of the input files is text/csv.
Initially, the data for a MapReduce task is stored in input files, and input files typically reside in HDFS. Although these files format is arbitrary, line-based log files and binary format can be used.
Hadoop handles large files by splitting them to blocks of size 64MB or 128MB (default). These blocks are available across Datanodes and metadata is in Namenode. When mapreduce program runs each block gets a mapper for execution. You cannot set number of mappers.
a) Text InputFormat- It is the default InputFormat of MapReduce. It uses each line of each input file as the separate record.
MapReduce is a programming model for processing large data sets with a parallel , distributed algorithm on a cluster (source: Wikipedia). Map Reduce when coupled with HDFS can be used to handle big data.
If you're not compressing the files then hadoop will process your large files (say 10G), with a number of mappers related to the block size of the file.
Say your block size is 64M, then you will have ~160 mappers processing this 10G file (160*64 ~= 10G). Depending on how CPU intensive your mapper logic is, this might be an acceptable blocks size, but if you find that your mappers are executing in sub minute times, then you might want to increase the work done by each mapper (by increasing the block size to 128, 256, 512m - the actual size depends on how you intend to process the data).
A larger blocks size will reduce the number of mappers used to process the 10G file. You can of course increase the minimum split size used by the TextInputFormat, but then you'll most probably run into lower data locality as the mapper may be processing 2 or more blocks, which may not all reside locally on that node.
As for output, this again depends on what your processing logic is doing - can you partition just by introducing more reducers? This will create more output files, but what partitioning logic do you require for these files (by default they will be hash partitioned by your key)
Size of the input files:
One way to tune this is to look at how fast your map tasks are completing. Each map task will take in 1 file as input and if they are completing in under 30-40 seconds than you should consider increasing the size of each file so that each mapper has more work to do. This is because a map task takes about 30 seconds to initialize before it does any actual work.
It also depends on how many map tasks your cluster can run at one time. You can try to tune your file and block sizes so that you take advantage of as many map tasks as possible. See this blog post for more ideas: http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
Size of output files:
The simple way to do this is to specify more than one reducer (each reducer will produce a single output file). If you want to partition your results by some key (say, year-month) you can include that in the output key of your map task and they will be sorted to the same reducer. Then you just need to check each file to see what year-month key it has.
Compression:
I recommend that you look at compressing your files. Doing this will make the input files "bigger" since each one will contain more data for a single map task to operate on. It will also reduce the amount of disk you use in your cluster. If anything, it might also increase the performance of mapreduce on your cluster because less disk I/O and network traffic will occur from reading and moving the files around.
Also, compress the intermediate output of your map task (output from the map task before it goes to the reducer). It will increase performance in a similar manner. This is done by setting mapred.compress.map.output=true
.
Hadoop divides up work based on the input split size. It divides your total data size by your split size and that's how it determines how many map jobs will occur. The general consensus is that you want between 10-100 maps per machine; from http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files. The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
With some input formats you can set the split size, by default most (including TextInputFormat) create one map per block. So, if you have several different files you'll end up with more non-complete 64mb blocks which is a waste of a map.
Processing one giant file is much more efficient than processing several files. The setup for the job takes longer when it has to account for multiple files. The core of hadoop was really centered around small numbers of large files. Also, HDFS is setup to handle small numbers of large files and the more files you have the more ram the namenode will eat in order to keep track of them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With