spark df.write.partitionBy run very slow

Question

I have a data frame that when saved as Parquet format takes ~11GB. When reading to a dataframe and writing to json, it takes 5 minutes. When I add partitionBy("day") it takes hours to finish. I understand that the distribution to partitions is the costly action. Is there a way to make it faster? Will sorting the files can make it better?

Example:

Run 5 minutes

df=spark.read.parquet(source_path).
df.write.json(output_path)

Run for hours

spark.read.parquet(source_path).createOrReplaceTempView("source_table")
sql="""
select cast(trunc(date,'yyyymmdd') as int) as day, a.*
from source_table a"""
spark.sql(sql).write.partitionBy("day").json(output_path)

Glennie Helles Sindholt · Accepted Answer

Try adding a repartition("day") before the write, like this:

spark
  .sql(sql)
  .repartition("day")
  .write
  .partitionBy("day")
  .json(output_path)

It should speed up your query.

spark df.write.partitionBy run very slow

Tags:

scala

apache-spark

apache-spark-sql

spark-dataframe

Gluz

1 Answers

Glennie Helles Sindholt

Recent Activity

Donate For Us

spark df.write.partitionBy run very slow

Tags:

scala

apache-spark

apache-spark-sql

spark-dataframe

Gluz

1 Answers

Glennie Helles Sindholt

Related questions

Recent Activity

Donate For Us