as can be derived from the question, I want to know when it makes sense to have input files in compressed format (like gzip) and when it makes sense to have input files in uncompressed format.
What is the overhead of having compressed files? Is it much slower when reading the file? Are there any benchmarks done on big input files?
Thx!
It mostly makes sense to have input files in compressed format unless you are doing development and you need to frequently read data from HDFS to local file system for working on it.
Compressed format provides significant advantage. The data is already replicated in Hadoop cluster unless you set it other wise. Replicated data is good redundancy but consumes more space. If all your data is replicated with a factor of 3, you are going to consume 3 times the capacity required to store it.
Compression on textual data like log data is very effective as it yield high compression ratio. This is also the kind of data that you usually find more often in Hadoop cluster.
I don't have benchmarks but I have not seen any significant penalty on a decent sized cluster and data that we have.
How ever, for time being choose LZO over gzip.
See: LZO compression and it's significance over gzip
Gzip compresses better than LZO. LZO is faster at compressing and uncompressing. It is possible to split Lzo files, splittable Gzip is not available but I have seen Jira tasks for the same. (Also for bzip2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With