I am looking for documentation on how parquet.enable.dictionary is to be used in Spark (latest 2.3.1). It can be set to "true" or "false" when creating a SparkSession.
I googled for any documentation on this feature and found nothing, or at least nothing recent.
Specifically these are my questions:
Is parquet.filter.dictionary.enabled = true or = false by default in Spark 2.3.1?
Is this a feature to enable (set to true) before I write to Parquet files so that Parquet library of Spark computes and writes the dictionary information to disk?
Is this setting ignored when Spark reads the Parquet files or do I still need to set it to true for reading parquet (as well as writing)?
When should I use this feature (set to true) ? Pros/Cons?
I also see references to this spark.hadoop.parquet.enable.dictionary when I googled for the parquet.enable.dictionary. Is this related? Which should I use?
Are there any other Spark + Parquet settings I need to be aware of?
Many thanks!
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons.
Parquet also allows you to compress data pages. (snappy, gzip, lzo) The compression codec can be set using spark command. One key thing to remember is when you compress the data, it has to be uncompressed when you read it during your process. When Spark runs this query, it first reads the footer of the parquet where the metadata is stored.
Configuration of Parquet can be done using the setConf method on SparkSession or by running SET key=value commands using SQL. Some other Parquet-producing systems, in particular Impala, Hive, and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema.
When true, the Parquet data source merges schemas collected from all data files, otherwise the schema is picked from the summary file or a random data file if no summary file is available. If true, data will be written in a way of Spark 1.4 and earlier.
These are the spark parquet config set to false by default -
spark.sql.parquet.mergeSchema
spark.sql.parquet.respectSummaryFiles
spark.sql.parquet.binaryAsString
spark.sql.parquet.int96TimestampConversion
spark.sql.parquet.int64AsTimestampMillis
spark.sql.parquet.writeLegacyFormat
spark.sql.parquet.recordLevelFilter.enabled
Below are set to true by default -
spark.sql.parquet.int96AsTimestamp
spark.sql.parquet.filterPushdown
spark.sql.parquet.filterPushdown.date
spark.sql.parquet.filterPushdown.timestamp
spark.sql.parquet.filterPushdown.decimal
spark.sql.parquet.filterPushdown.string.startsWith
spark.sql.parquet.enableVectorizedReader
These properties needs value and listing it with defaults-
spark.sql.parquet.outputTimestampType = INT96
spark.sql.parquet.compression.codec = snappy
spark.sql.parquet.pushdown.inFilterThreshold = 10
spark.sql.parquet.output.committer.class = org.apache.parquet.hadoop.ParquetOutputCommitter
spark.sql.parquet.columnarReaderBatchSize = 4096
Regarding parquet.enable.dictionary, it is not supported by Spark yet. But it can be set in sqlContext as -
sqlContext.setConf("parquet.enable.dictionary", "false")
Default value is of this property is true in parquet. Therefore, it should be true when parquet code is called from Spark.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With