I have a data frame that when saved as Parquet format takes ~11GB. When reading to a dataframe and writing to json, it takes 5 minutes. When I add partitionBy("day") it takes hours to finish. I understand that the distribution to partitions is the costly action. Is there a way to make it faster? Will sorting the files can make it better?
Example:
Run 5 minutes
df=spark.read.parquet(source_path).
df.write.json(output_path)
Run for hours
spark.read.parquet(source_path).createOrReplaceTempView("source_table")
sql="""
select cast(trunc(date,'yyyymmdd') as int) as day, a.*
from source_table a"""
spark.sql(sql).write.partitionBy("day").json(output_path)
Try adding a repartition("day")
before the write
, like this:
spark
.sql(sql)
.repartition("day")
.write
.partitionBy("day")
.json(output_path)
It should speed up your query.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With