I was writing data on Hadoop and hive in parquet format using spark. I want to enable compression but i can only find 2 types on compression - snappy and Gzip being used most of the times. Does parquet support any other compression like Deflate and lzo also?
The supported compression types for Apache Parquet are specified in the parquet-format
repository:
/**
* Supported compression algorithms.
*
* Codecs added in 2.4 can be read by readers based on 2.4 and later.
* Codec support may vary between readers based on the format version and
* libraries available at runtime. Gzip, Snappy, and LZ4 codecs are
* widely available, while Zstd and Brotli require additional libraries.
*/
enum CompressionCodec {
UNCOMPRESSED = 0;
SNAPPY = 1;
GZIP = 2;
LZO = 3;
BROTLI = 4; // Added in 2.4
LZ4 = 5; // Added in 2.4
ZSTD = 6; // Added in 2.4
}
https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L461
Snappy and Gzip are the most commonly used ones and are supported by all implementations. LZ4 and ZSTD yield better results the former two but are a rather new addition to the format, so they are only supported in the newer versions of some of the implementations.
From the Spark source code, branch 2.1:
You can set the following Parquet-specific option(s) for writing Parquet files:
compression
(default is the value specified inspark.sql.parquet.compression.codec
): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none
,snappy
,gzip
, andlzo
).
This will overridespark.sql.parquet.compression.codec
...
overall supported compresssions are: none
, uncompressed
, snappy
, gzip
, lzo
, brotli
, lz4
, and zstd
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With