We've realized a bit too late that archiving our files in GZip format for Hadoop processing isn't such a great idea. GZip isn't splittable, and for reference, here are the problems which I won't repeat: <ul> <li>Very basic question about Hadoop and compressed input files</li> <li>Hadoop gzip compressed files</li> <li>Hadoop gzip input file using only one mapper</li> <li>Why can't hadoop split up a large text file and then compress the splits using gzip?</li> </ul> My question is: is BZip2 the best archival compression that will allow a single archive file to be processed in parallel by Hadoop? Gzip is definitely not, and from my reading LZO has some problems.

I don't consider the other answer correct, bzip2 according to this: http://comphadoop.weebly.com/ is splittable. LZO is too if indexed. So the answer is yes, if you want to use more mappers than you have files, then you'll want to use bzip2. To do this, you could write a simple MR job to read the data then just write it out again, you then need to ensure you set <code>mapred.output.compression.codec</code> to <code>org.apache.hadoop.io.compress.BZip2Codec</code>

BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming. LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner. LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.

Best splittable compression for Hadoop input = bz2?

Tags:

gzip

hadoop

bzip2

hdfs

We've realized a bit too late that archiving our files in GZip format for Hadoop processing isn't such a great idea. GZip isn't splittable, and for reference, here are the problems which I won't repeat:

Very basic question about Hadoop and compressed input files
Hadoop gzip compressed files
Hadoop gzip input file using only one mapper
Why can't hadoop split up a large text file and then compress the splits using gzip?

My question is: is BZip2 the best archival compression that will allow a single archive file to be processed in parallel by Hadoop? Gzip is definitely not, and from my reading LZO has some problems.

796

asked Feb 11 '13 20:02

Suman

2 Answers

I don't consider the other answer correct, bzip2 according to this:

http://comphadoop.weebly.com/

is splittable. LZO is too if indexed.

So the answer is yes, if you want to use more mappers than you have files, then you'll want to use bzip2.

To do this, you could write a simple MR job to read the data then just write it out again, you then need to ensure you set mapred.output.compression.codec to org.apache.hadoop.io.compress.BZip2Codec

answered Sep 28 '22 22:09

samthebest

BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming.

LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner.

LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.

answered Sep 28 '22 21:09

Carlo Medas

Related questions
                            
                                Difference between Application Manager and Application Master in YARN?
                            
                                How to get names of the currently running hadoop jobs?
                            
                                How does Hadoop Namenode failover process works?
                            
                                How to change date format in hive?
                            
                                Iterate twice on values (MapReduce)
                            
                                Does Hive have something equivalent to DUAL?
                            
                                Hadoop input split size vs block size
                            
                                How to unzip .gz files in a new directory in hadoop?
                            
                                What is sequence file in hadoop?
                            
                                Books to start learning big data [closed]
                            
                                Unable to start cygwin sshd service
                            
                                How to check if Hadoop daemons are running?
                            
                                hadoop fs -put command
                            
                                What does msck stands for in Msck repair command
                            
                                How to copy data from one HDFS to another HDFS?
                            
                                How does Spark running on YARN account for Python memory usage?
                            
                                What is the advantage of storing schema in avro?
                            
                                Parquet without Hadoop?
                            
                                Writing to HDFS could only be replicated to 0 nodes instead of minReplication (=1)
                            
                                How to rename a hive table without changing location?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With