Context:
Left:
column_a: INT64 SNAPPY DO:0 FPO:4 SZ:5179987/6161135/1.19 VC:770100 ENC:PLAIN,RLE,BIT_PACKED
Right:
column_a: INT64 SNAPPY DO:0 FPO:4 SZ:3040269/5671813/1.87 VC:782499 ENC:BIT_PACKED,PLAIN,RLE,PLAIN_DICTIONARY
My Question:
How does parquet determine what encoding type to use and what could have made parquet choose different encoding? Is it something we can control using a Hive / Spark config?
I think that the mailing list message here with reply here has the best answer that I'm aware of. In short, you can't directly control the encoding that Parquet uses for any given column. There are some things that may help to improve it a bit, like specifying that you wish to write a Parquet Version 2 file rather than a Parquet Version 1 file, but that's not precise control. There may be something that can be done, but it would probably involve diving deep into the internals of a Parquet implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With