Disable parquet metadata summary in Spark

Tags:

parquet

I have a spark job (for 1.4.1) receiving a stream of kafka events. I would like to save them continuously as parquet on tachyon.

val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)

lines.window(Seconds(1), Seconds(1)).foreachRDD { (rdd, time) =>
  if (rdd.count() > 0) {
    val mil = time.floor(Duration(86400000)).milliseconds
    hiveContext.read.json(rdd).toDF().write.mode(SaveMode.Append).parquet(s"tachyon://192.168.1.12:19998/persisted5$mil")
    hiveContext.sql(s"CREATE TABLE IF NOT EXISTS persisted5$mil USING org.apache.spark.sql.parquet OPTIONS ( path 'tachyon://192.168.1.12:19998/persisted5$mil')")
  }
}

however I see that as time goes on, on every parquet write, spark goes through each 1 sec parquet parts, which get slower and slower

15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-db03b24d-6f98-4b5d-bb40-530f35b82633.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-3a7857e2-0435-4ee0-ab2c-6d40224f8842.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-47ff2ac1-da00-4473-b3f7-52640014bc5b.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-61625436-7353-4b1e-bb8d-e8afad3a582e.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-e711aa9a-9bf5-41d5-8523-f5edafa69626.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-4e0cca38-cf75-4771-8965-20a30c863100.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-d1510ed4-2c99-43e2-b3d1-38d3d54e626d.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-022d1918-392d-433f-a7f4-074e46b4460f.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-cf71f5d2-ba0e-4729-9aa1-41dad5d1d08f.gz.parquet, 65536)
15/08/22 22:04:05 INFO : open(tachyon://192.168.1.12:19998/persisted51440201600000/part-r-00000-ce990b1e-82cc-4feb-a162-ac3ddc275609.gz.parquet, 65536)

I came to the conclusion that this is due to the update of summary data, I believe spark do not make use of them. so I would like to disable it

parquet sources shows that I should be able to set "parquet.enable.summary-metadata" to false.

now, I have tried setting it like this, right after creating hiveContext

hiveContext.sparkContext.hadoopConfiguration.setBoolean("parquet.enable.summary-metadata", false)
hiveContext.sparkContext.hadoopConfiguration.setInt("parquet.metadata.read.parallelism", 10)

but without success, I also still get logs showing a parallelism of 5 (default).

What is the correct way to disable summary data in spark with parquet?

618

asked Aug 22 '15 21:08

2 Answers

setting "parquet.enable.summary-metadata" as text ("false" and not false) seems to work for us.

By the way Spark does use the _common_metadata file (we copy that over manually for repetitive jobs)

answered Oct 19 '22 06:10

Arnon Rotem-Gal-Oz

Spark 2.0 doesn't save metadata summaries by default any more, see SPARK-15719.

If you are working with data hosted in S3, you may still find parquet performance hit by parquet itself trying to scan the tail of all objects to check their schemas. That can be disabled explicitly

sparkConf.set("spark.sql.parquet.mergeSchema", "false")

answered Oct 19 '22 05:10

stevel

Related questions
                            
                                Filtering rows with empty arrays in PySpark
                            
                                Spark read s3 using sc.textFile("s3a://bucket/filePath"). java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager
                            
                                DataFrame columns names conflict with .(dot)
                            
                                How to make it easier to deploy my Jar to Spark Cluster in standalone mode?
                            
                                Spark : How to use mapPartition and create/close connection per partition
                            
                                Why does conf.set("spark.app.name", appName) not set the name in the UI?
                            
                                spark - scala: not a member of org.apache.spark.sql.Row
                            
                                calculating percentages on a pyspark dataframe
                            
                                SparkSQL and explode on DataFrame in Java
                            
                                Pyspark dataframe how to drop rows with nulls in all columns?
                            
                                Spark Select with a List of Columns Scala
                            
                                How to overwrite Spark ML model in PySpark?
                            
                                Pyspark AWS credentials
                            
                                How to get nth row of Spark RDD?
                            
                                Removing punctuation marks form text in Scala - Spark
                            
                                Add a new column to a Dataframe. New column i want it to be a UUID generator
                            
                                The SPARK_HOME env variable is set but Jupyter Notebook doesn't see it. (Windows)
                            
                                How to improve broadcast Join speed with between condition in Spark
                            
                                How to use lag and rangeBetween functions on timestamp values?
                            
                                Spark: Joining with array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Disable parquet metadata summary in Spark

Tags:

apache-spark

parquet

Pierre Lacave

People also ask

2 Answers

Arnon Rotem-Gal-Oz

stevel

Recent Activity

Donate For Us