How does parquet determine which encoding to use?

Question

Context:

I have two identical datasets (say Left and Right) with 112 parquet files in each dataset.
These parquet files were created using Hive, by converting delimited flat files.
The process used to create the delimited flat files files is slightly different between left and right processes.
I noticed a significant size difference between both datasets, even though the content in them is exactly the same. Left dataset is 400GB and Right dataset is 420GB.
When I checked the Parquet schema using parquet-tools, I noticed different encoding was used for left and right datasets as shown (for one column) below:

Left:

column_a:  INT64 SNAPPY DO:0 FPO:4 SZ:5179987/6161135/1.19 VC:770100 ENC:PLAIN,RLE,BIT_PACKED

Right:

column_a:  INT64 SNAPPY DO:0 FPO:4 SZ:3040269/5671813/1.87 VC:782499 ENC:BIT_PACKED,PLAIN,RLE,PLAIN_DICTIONARY

My Question:

How does parquet determine what encoding type to use and what could have made parquet choose different encoding? Is it something we can control using a Hive / Spark config?

bnsmith · Accepted Answer

I think that the mailing list message here with reply here has the best answer that I'm aware of. In short, you can't directly control the encoding that Parquet uses for any given column. There are some things that may help to improve it a bit, like specifying that you wish to write a Parquet Version 2 file rather than a Parquet Version 1 file, but that's not precise control. There may be something that can be done, but it would probably involve diving deep into the internals of a Parquet implementation.

How does parquet determine which encoding to use?

Tags:

apache-spark

hadoop

hive

parquet

Baahubali

1 Answers

bnsmith

Recent Activity

Donate For Us

How does parquet determine which encoding to use?

Tags:

apache-spark

hadoop

hive

parquet

Baahubali

1 Answers

bnsmith

Related questions

Recent Activity

Donate For Us