Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does parquet determine which encoding to use?

Context:

  • I have two identical datasets (say Left and Right) with 112 parquet files in each dataset.
  • These parquet files were created using Hive, by converting delimited flat files.
  • The process used to create the delimited flat files files is slightly different between left and right processes.
  • I noticed a significant size difference between both datasets, even though the content in them is exactly the same. Left dataset is 400GB and Right dataset is 420GB.
  • When I checked the Parquet schema using parquet-tools, I noticed different encoding was used for left and right datasets as shown (for one column) below:

Left:

column_a:  INT64 SNAPPY DO:0 FPO:4 SZ:5179987/6161135/1.19 VC:770100 ENC:PLAIN,RLE,BIT_PACKED

Right:

column_a:  INT64 SNAPPY DO:0 FPO:4 SZ:3040269/5671813/1.87 VC:782499 ENC:BIT_PACKED,PLAIN,RLE,PLAIN_DICTIONARY

My Question:

How does parquet determine what encoding type to use and what could have made parquet choose different encoding? Is it something we can control using a Hive / Spark config?

like image 684
Baahubali Avatar asked Oct 15 '22 04:10

Baahubali


1 Answers

I think that the mailing list message here with reply here has the best answer that I'm aware of. In short, you can't directly control the encoding that Parquet uses for any given column. There are some things that may help to improve it a bit, like specifying that you wish to write a Parquet Version 2 file rather than a Parquet Version 1 file, but that's not precise control. There may be something that can be done, but it would probably involve diving deep into the internals of a Parquet implementation.

like image 55
bnsmith Avatar answered Oct 31 '22 21:10

bnsmith