I am looking for documentation on how parquet.enable.dictionary is to be used in Spark (latest 2.3.1). It can be set to "true" or "false" when creating a SparkSession. I googled for any documentation on this feature and found nothing, or at least nothing recent. Specifically these are my questions: Is parquet.filter.dictionary.enabled = true or = false by default in Spark 2.3.1? Is this a feature to enable (set to true) before I write to Parquet files so that Parquet library of Spark computes and writes the dictionary information to disk? Is this setting ignored when Spark reads the Parquet files or do I still need to set it to true for reading parquet (as well as writing)? When should I use this feature (set to true) ? Pros/Cons? I also see references to this spark.hadoop.parquet.enable.dictionary when I googled for the parquet.enable.dictionary. Is this related? Which should I use? Are there any other Spark + Parquet settings I need to be aware of? Many thanks!

These are the spark parquet config set to false by default - <pre class="prettyprint"><code>spark.sql.parquet.mergeSchema spark.sql.parquet.respectSummaryFiles spark.sql.parquet.binaryAsString spark.sql.parquet.int96TimestampConversion spark.sql.parquet.int64AsTimestampMillis spark.sql.parquet.writeLegacyFormat spark.sql.parquet.recordLevelFilter.enabled </code></pre> Below are set to true by default - <pre class="prettyprint"><code>spark.sql.parquet.int96AsTimestamp spark.sql.parquet.filterPushdown spark.sql.parquet.filterPushdown.date spark.sql.parquet.filterPushdown.timestamp spark.sql.parquet.filterPushdown.decimal spark.sql.parquet.filterPushdown.string.startsWith spark.sql.parquet.enableVectorizedReader </code></pre> These properties needs value and listing it with defaults- <pre class="prettyprint"><code>spark.sql.parquet.outputTimestampType = INT96 spark.sql.parquet.compression.codec = snappy spark.sql.parquet.pushdown.inFilterThreshold = 10 spark.sql.parquet.output.committer.class = org.apache.parquet.hadoop.ParquetOutputCommitter spark.sql.parquet.columnarReaderBatchSize = 4096 </code></pre> Regarding parquet.enable.dictionary, it is not supported by Spark yet. But it can be set in sqlContext as - <pre class="prettyprint"><code>sqlContext.setConf("parquet.enable.dictionary", "false") </code></pre> Default value is of this property is true in parquet. Therefore, it should be true when parquet code is called from Spark.

Spark 2.3+ use of parquet.enable.dictionary?

1 Answers

These are the spark parquet config set to false by default -

spark.sql.parquet.mergeSchema
spark.sql.parquet.respectSummaryFiles
spark.sql.parquet.binaryAsString
spark.sql.parquet.int96TimestampConversion
spark.sql.parquet.int64AsTimestampMillis
spark.sql.parquet.writeLegacyFormat
spark.sql.parquet.recordLevelFilter.enabled

Below are set to true by default -

spark.sql.parquet.int96AsTimestamp
spark.sql.parquet.filterPushdown
spark.sql.parquet.filterPushdown.date
spark.sql.parquet.filterPushdown.timestamp
spark.sql.parquet.filterPushdown.decimal
spark.sql.parquet.filterPushdown.string.startsWith
spark.sql.parquet.enableVectorizedReader

These properties needs value and listing it with defaults-

spark.sql.parquet.outputTimestampType = INT96
spark.sql.parquet.compression.codec = snappy
spark.sql.parquet.pushdown.inFilterThreshold = 10
spark.sql.parquet.output.committer.class = org.apache.parquet.hadoop.ParquetOutputCommitter
spark.sql.parquet.columnarReaderBatchSize = 4096

Regarding parquet.enable.dictionary, it is not supported by Spark yet. But it can be set in sqlContext as -

sqlContext.setConf("parquet.enable.dictionary", "false")

Default value is of this property is true in parquet. Therefore, it should be true when parquet code is called from Spark.

180

answered Sep 22 '22 14:09

Ajay Srivastava

Related questions
                            
                                What is the difference between Driver and Application manager in spark
                            
                                spark <console>:12: error: not found: value sc
                            
                                Why are aggregate and fold two different APIs in Spark?
                            
                                Spark can no longer execute jobs. Executors fail to create directory
                            
                                SparkSQL MissingRequirementError when registering table
                            
                                How to get Histogram of all columns in a large CSV / RDD[Array[double]] using Apache Spark Scala?
                            
                                How to control number of parquet files generated when using partitionBy
                            
                                Numpy and static linking
                            
                                Difference between Apache spark mllib.linalg vectors and spark.util vectors for machine learning
                            
                                Spark Exception : Task failed while writing rows
                            
                                Spark netlib-java BLAS
                            
                                how to make RMSE(root mean square error) small when use ALS of spark?
                            
                                ALS model - how to generate full_u * v^t * v?
                            
                                Apache Toree to connect to a remote spark cluster
                            
                                Custom log4j.properties on AWS EMR
                            
                                (python) Spark .textFile(s3://...) access denied 403 with valid credentials
                            
                                Reading JSON files into Spark Dataset and adding columns from a separate Map
                            
                                How do I interpret Input size / records in Spark Stage UI
                            
                                my spark sql limit is very slow
                            
                                Why do I get a “Hive support is required to CREATE Hive TABLE (AS SELECT)” error when creating a table?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark 2.3+ use of parquet.enable.dictionary?

Tags:

apache-spark

parquet

Acid Rider

People also ask

1 Answers

Ajay Srivastava

Recent Activity

Donate For Us