I've recently been looking into hadoop and HDFS. When you load a file into HDFS, it will normally split the file into 64MB chunks and distribute these chunks around your cluster. Except it can't do this with gzip'd files because a gzip'd file can't be split. I completely understand why this is the case (I don't need anyone explaining why a gzip'd file can't be split up). But why couldn't HDFS take a plain text file as input and split it like normal, then compress each split using gzip separately? When any split is accessed, it's just decompressed on the fly. In my scenario, each split is compressed completely independently. There's no dependencies between splits, so you don't need the entire original file to decompress any one of the splits. That is the approach this patch takes: https://issues.apache.org/jira/browse/HADOOP-7076, note that this is not what I'd want. This seems pretty basic... what am I missing? Why couldn't this be done? Or if it could be done, why have the hadoop developers not looked down this route? It seems strange given how much discussion I've found regarding people wanting split gzip'd files in HDFS.

The simple reason is the design principle of "separation of concerns". If you do what you propose then HDFS must know what the actual bits and bytes of the file mean. Also HDFS must be made able to reason about it (i.e. extract, decompress, etc.). In general you don't want this kind of mixing up responsibilities in software. So the 'only' part that is to understand what the bits mean is the application that must be able to read it: which is commonly written using the MapReduce part of Hadoop. As stated in the Javadoc of HADOOP-7076 (I wrote that thing ;) ): <blockquote> Always remember that there are alternative approaches: <ul> <li>Decompress the original gzipped file, split it into pieces and recompress the pieces before offering them to Hadoop. For example: Splitting gzipped logfiles without storing the ungzipped splits on disk </li> <li>Decompress the original gzipped file and compress using a different splittable codec. For example BZip2Codec or not compressing at all.</li> </ul> </blockquote> HTH

The HDFS has a limited scope of being only a distributed file-system service and doesn't do heavy-lifting operations such as compressing the data. The actual process of data compression is delegated to distributed execution frameworks like Map-Reduce, Spark, Tez etc. So compression of data/files is the concern of the execution framework and not that of the File System. Additionally the presence of container file formats like Sequence-file, Parquet etc negates the need of HDFS to compress the Data blocks automatically as suggested by the question. So to summarize due to design philosophy reasons any compression of data must be done by the execution engine not by the file system service.

Why can't hadoop split up a large text file and then compress the splits using gzip?

Tags:

compression

gzip

hadoop

hdfs

I've recently been looking into hadoop and HDFS. When you load a file into HDFS, it will normally split the file into 64MB chunks and distribute these chunks around your cluster. Except it can't do this with gzip'd files because a gzip'd file can't be split. I completely understand why this is the case (I don't need anyone explaining why a gzip'd file can't be split up). But why couldn't HDFS take a plain text file as input and split it like normal, then compress each split using gzip separately? When any split is accessed, it's just decompressed on the fly.

In my scenario, each split is compressed completely independently. There's no dependencies between splits, so you don't need the entire original file to decompress any one of the splits. That is the approach this patch takes: https://issues.apache.org/jira/browse/HADOOP-7076, note that this is not what I'd want.

This seems pretty basic... what am I missing? Why couldn't this be done? Or if it could be done, why have the hadoop developers not looked down this route? It seems strange given how much discussion I've found regarding people wanting split gzip'd files in HDFS.

625

asked Jun 28 '11 18:06

onlynone

2 Answers

The simple reason is the design principle of "separation of concerns".

If you do what you propose then HDFS must know what the actual bits and bytes of the file mean. Also HDFS must be made able to reason about it (i.e. extract, decompress, etc.). In general you don't want this kind of mixing up responsibilities in software.

So the 'only' part that is to understand what the bits mean is the application that must be able to read it: which is commonly written using the MapReduce part of Hadoop.

As stated in the Javadoc of HADOOP-7076 (I wrote that thing ;) ):

Always remember that there are alternative approaches:

Decompress the original gzipped file, split it into pieces and recompress the pieces before offering them to Hadoop.
For example: Splitting gzipped logfiles without storing the ungzipped splits on disk

Decompress the original gzipped file and compress using a different splittable codec. For example BZip2Codec or not compressing at all.

HTH

155

answered Nov 03 '22 00:11

Niels Basjes

The HDFS has a limited scope of being only a distributed file-system service and doesn't do heavy-lifting operations such as compressing the data. The actual process of data compression is delegated to distributed execution frameworks like Map-Reduce, Spark, Tez etc. So compression of data/files is the concern of the execution framework and not that of the File System.

Additionally the presence of container file formats like Sequence-file, Parquet etc negates the need of HDFS to compress the Data blocks automatically as suggested by the question.

So to summarize due to design philosophy reasons any compression of data must be done by the execution engine not by the file system service.

answered Nov 03 '22 00:11

rogue-one

Related questions
                            
                                Wav audio file compression not working
                            
                                WiX generated MSI is not compressed
                            
                                Imagemagick animated gif size optimization
                            
                                how to use sox to compress mp3?
                            
                                Compression of existing file using h5py
                            
                                why the compression ration is 0 using JSZip
                            
                                Encoding a String as a picture causes compression
                            
                                Maximum compression of a MSI install using WIX
                            
                                compression method for xlsx with 7z
                            
                                Smart PHP compression code
                            
                                How does ASP.NET vNext handle Caching, Compression & MimeMap in config.json?
                            
                                Compress a PNG image with ImageMagick
                            
                                Sloot Digital Coding System
                            
                                Compress Outgoing Requests in Angular 2+
                            
                                How do I transparently compress/decompress a file as a program writes to/reads from it?
                            
                                How to minify jquery files?
                            
                                Does Subversion include compression?
                            
                                Is this a bug in this gzip inflate method?
                            
                                Best .NET Framework compression class?
                            
                                Compression and Lookup of huge list of words

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With