Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Enum equivalent in Spark Dataframe/Parquet

I have a table with hundreds of millions of rows, that I want to store in a dataframe in Spark and persist to disk as a parquet file.

The size of my Parquet file(s) is now in excess of 2TB and I want to make sure I have optimized this.

A large proportion of these columns are string values, that can be lengthy, but also often have very few values. For example I have a column with only two distinct values (a 20charcter and a 30 character string) and I have another column with a string that is on average 400characters long but only has about 400 distinct values across all entries.

In a relational database I would usually normalize those values out into a different table with references, or at least define my table with some sort of enum type.

I cannot see anything that matches that pattern in DF or parquet files. Is the columnar storage handling this efficiently? Or should I look into something to optimize this further?

like image 974
Hans Avatar asked Jun 23 '17 14:06

Hans


1 Answers

Parquet doesn't have a mechanism for automatically generating enum-like types, but you can use the page dictionary. The page dictionary stores a list of values per parquet page to allow the rows to just reference back to the dictionary instead of rewriting the data. To enable the dictionary for the parquet writer in spark:

spark.conf.set("parquet.dictionary.enabled", "true")
spark.conf.set("parquet.dictionary.page.size", 2 * 1024 * 1024)

Note that you have to write the file with these options enabled, or it won't be used.

To enable filtering for existence using the dictionary, you can enable

spark.conf.set("parquet.filter.dictionary.enabled", "true")

Source: Parquet performance tuning: The missing guide

like image 121
Steven Sheffey Avatar answered Oct 06 '22 19:10

Steven Sheffey