Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 2.3+ use of parquet.enable.dictionary?

I am looking for documentation on how parquet.enable.dictionary is to be used in Spark (latest 2.3.1). It can be set to "true" or "false" when creating a SparkSession.

I googled for any documentation on this feature and found nothing, or at least nothing recent.

Specifically these are my questions:

Is parquet.filter.dictionary.enabled = true or = false by default in Spark 2.3.1?

Is this a feature to enable (set to true) before I write to Parquet files so that Parquet library of Spark computes and writes the dictionary information to disk?

Is this setting ignored when Spark reads the Parquet files or do I still need to set it to true for reading parquet (as well as writing)?

When should I use this feature (set to true) ? Pros/Cons?

I also see references to this spark.hadoop.parquet.enable.dictionary when I googled for the parquet.enable.dictionary. Is this related? Which should I use?

Are there any other Spark + Parquet settings I need to be aware of?

Many thanks!

like image 218
Acid Rider Avatar asked Sep 14 '18 05:09

Acid Rider


People also ask

What is Spark SQL parquet?

Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

How to compress parquet data in spark?

Parquet also allows you to compress data pages. (snappy, gzip, lzo) The compression codec can be set using spark command. One key thing to remember is when you compress the data, it has to be uncompressed when you read it during your process. When Spark runs this query, it first reads the footer of the parquet where the metadata is stored.

How do I create a parquet schema in spark?

Configuration of Parquet can be done using the setConf method on SparkSession or by running SET key=value commands using SQL. Some other Parquet-producing systems, in particular Impala, Hive, and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema.

How does the parquet data source work?

When true, the Parquet data source merges schemas collected from all data files, otherwise the schema is picked from the summary file or a random data file if no summary file is available. If true, data will be written in a way of Spark 1.4 and earlier.


1 Answers

These are the spark parquet config set to false by default -

spark.sql.parquet.mergeSchema
spark.sql.parquet.respectSummaryFiles
spark.sql.parquet.binaryAsString
spark.sql.parquet.int96TimestampConversion
spark.sql.parquet.int64AsTimestampMillis
spark.sql.parquet.writeLegacyFormat
spark.sql.parquet.recordLevelFilter.enabled

Below are set to true by default -

spark.sql.parquet.int96AsTimestamp
spark.sql.parquet.filterPushdown
spark.sql.parquet.filterPushdown.date
spark.sql.parquet.filterPushdown.timestamp
spark.sql.parquet.filterPushdown.decimal
spark.sql.parquet.filterPushdown.string.startsWith
spark.sql.parquet.enableVectorizedReader

These properties needs value and listing it with defaults-

spark.sql.parquet.outputTimestampType = INT96
spark.sql.parquet.compression.codec = snappy
spark.sql.parquet.pushdown.inFilterThreshold = 10
spark.sql.parquet.output.committer.class = org.apache.parquet.hadoop.ParquetOutputCommitter
spark.sql.parquet.columnarReaderBatchSize = 4096

Regarding parquet.enable.dictionary, it is not supported by Spark yet. But it can be set in sqlContext as -

sqlContext.setConf("parquet.enable.dictionary", "false")

Default value is of this property is true in parquet. Therefore, it should be true when parquet code is called from Spark.

like image 180
Ajay Srivastava Avatar answered Sep 22 '22 14:09

Ajay Srivastava