Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark SQL - difference between gzip vs snappy vs lzo compression formats

I am trying to use Spark SQL to write parquet file.

By default Spark SQL supports gzip, but it also supports other compression formats like snappy and lzo.

What is the difference between these compression formats?

like image 894
Shankar Avatar asked Mar 04 '16 06:03

Shankar


People also ask

What are the compression techniques in spark?

There are three compression algorithms commonly used in Spark environments: GZIP, Snappy, and bzip2.

Does data compression using the LZO compression algorithm?

LZO compression is a lossless data compression library favoring speed over compression ratio; LZO compression is recommended for temporary tables. You can enable LZO compression for HDP to optimize Hive query speed.

Is bzip2 better than GZip?

bzip2 has notably better compression ratio than gzip, which has to be the reason for the popularity of bzip2; it is slower than gzip especially in decompression and uses more memory. However the memory requirements of bzip2 should be nowadays no problem even on older hardware.

How does snappy compression work?

Snappy is intended to be fast. On a single core of a Core i7 processor in 64-bit mode, it compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more. (These numbers are for the slowest inputs in our benchmark suite; others are much faster.)


1 Answers

Compression Ratio : GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio.

General Usage : GZip is often a good choice for cold data, which is accessed infrequently. Snappy or LZO are a better choice for hot data, which is accessed frequently.

Snappy often performs better than LZO. It is worth running tests to see if you detect a significant difference.

Splittablity : If you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not.

GZIP compresses data 30% more as compared to Snappy and 2x more CPU when reading GZIP data compared to one that is consuming Snappy data.

LZO focus on decompression speed at low CPU usage and higher compression at the cost of more CPU.

For longer term/static storage, the GZip compression is still better.

See extensive research and benchmark code and results in this article (Performance of various general compression algorithms – some of them are unbelievably fast!).

enter image description here

like image 110
Ram Ghadiyaram Avatar answered Sep 19 '22 04:09

Ram Ghadiyaram