spark parquet write gets slow as partitions grow

Tags:

I have a spark streaming application that writes parquet data from stream.

sqlContext.sql(
      """
        |select
        |to_date(from_utc_timestamp(from_unixtime(at), 'US/Pacific')) as event_date,
        |hour(from_utc_timestamp(from_unixtime(at), 'US/Pacific')) as event_hour,
        |*
        |from events
        | where at >= 1473667200
      """.stripMargin).coalesce(1).write.mode(SaveMode.Append).partitionBy("event_date", "event_hour","verb").parquet(Config.eventsS3Path)

this piece of code runs every hour but over time the writing to parquet has slowed down. When we started it took 15 mins to write data, now it takes 40 mins. It is taking time propotional to data existing in that path. I tried running the same application to a new location and that runs fast.

I have disabled schemaMerge and summary metadata:

sparkConf.set("spark.sql.hive.convertMetastoreParquet.mergeSchema","false")
sparkConf.set("parquet.enable.summary-metadata","false")

using spark 2.0

batch execution: empty directory enter image description here directory with 350 folders

692

asked Sep 16 '16 06:09

Gaurav Shah

Video Answer

1 Answers

I've encountered this issue. The append mode is probably the culprit, in that finding the append location takes more and more time as the size of your parquet file grows.

One workaround I've found that solves this is to change the output path regularly. Merging and reordering the data from all the output dataframes is then usually not an issue.

def appendix: String = ((time.milliseconds - timeOrigin) / (3600 * 1000)).toString

df.write.mode(SaveMode.Append).format("parquet").save(s"${outputPath}-H$appendix")

answered Sep 19 '22 15:09

Francois G

Related questions
                            
                                Aggregate function in spark-sql not found
                            
                                Python worker failed to connect back
                            
                                NullPointerException in Scala Spark, appears to be caused be collection type?
                            
                                Spark com.fasterxml.jackson.module error
                            
                                How to count number of columns in Spark Dataframe?
                            
                                Upload zip file using --archives option of spark-submit on yarn
                            
                                Removing empty strings from maps in scala
                            
                                idea sbt java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
                            
                                How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?
                            
                                "Bad substitution" when submitting spark job to yarn-cluster
                            
                                PySpark: when function with multiple outputs [duplicate]
                            
                                Convert pyspark.sql.dataframe.DataFrame type Dataframe to Dictionary
                            
                                Spark LDA consumes too much memory
                            
                                apache spark "Py4JError: Answer from Java side is empty"
                            
                                SparkUI for pyspark - corresponding line of code for each stage?
                            
                                How to read/write protocol buffer messages with Apache Spark?
                            
                                In Apache Spark, how to convert a slow RDD/dataset into a stream?
                            
                                What is happening when Spark is calling ShuffleBlockFetcherIterator?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

spark parquet write gets slow as partitions grow

Tags:

apache-spark

parquet

partitioning

Gaurav Shah

People also ask

Video Answer

1 Answers

Francois G

Recent Activity

Donate For Us