DataFrame partitionBy to a single Parquet file (per partition)

Tags:

apache-spark-sql

I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API. So I could do that like this:

df.coalesce(1)     .write     .partitionBy("entity", "year", "month", "day", "status")     .mode(SaveMode.Append)     .parquet(s"$location")

I've tested this and it doesn't seem to perform well. This is because there is only one partition to work on in the dataset and all the partitioning, compression and saving of files has to be done by one CPU core.

I could rewrite this to do the partitioning manually (using filter with the distinct partition values for example) before calling coalesce.

But is there a better way to do this using the standard Spark SQL API?

256

asked Jan 14 '16 12:01

Patrick McGloin

1 Answers

I had the exact same problem and I found a way to do this using DataFrame.repartition(). The problem with using coalesce(1) is that your parallelism drops to 1, and it can be slow at best and error out at worst. Increasing that number doesn't help either -- if you do coalesce(10) you get more parallelism, but end up with 10 files per partition.

To get one file per partition without using coalesce(), use repartition() with the same columns you want the output to be partitioned by. So in your case, do this:

import spark.implicits._ df.repartition($"entity", $"year", $"month", $"day", $"status").write.partitionBy("entity", "year", "month", "day", "status").mode(SaveMode.Append).parquet(s"$location")

Once I do that I get one parquet file per output partition, instead of multiple files.

I tested this in Python, but I assume in Scala it should be the same.

144

answered Oct 06 '22 00:10

mortada

Related questions
                            
                                How to export data from Spark SQL to CSV
                            
                                What's the difference between Spark ML and MLLIB packages
                            
                                How to assign unique contiguous numbers to elements in a Spark RDD
                            
                                Filtering DataFrame using the length of a column
                            
                                Spark parquet partitioning : Large number of files
                            
                                How do I convert csv file to rdd
                            
                                Where are logs in Spark on YARN?
                            
                                Spark yarn cluster vs client - how to choose which one to use?
                            
                                Spark read file from S3 using sc.textFile ("s3n://...)
                            
                                How do I check for equality using Spark Dataframe without SQL Query?
                            
                                When are accumulators truly reliable?
                            
                                Spark dataframe: collect () vs select ()
                            
                                Convert a spark DataFrame to pandas DF
                            
                                Including null values in an Apache Spark Join
                            
                                Spark DataFrame TimestampType - how to get Year, Month, Day values from field?
                            
                                How to prevent Spark Executors from getting Lost when using YARN client mode?
                            
                                What's the difference between join and cogroup in Apache Spark
                            
                                How to convert Row of a Scala DataFrame into case class most efficiently?
                            
                                Apply StringIndexer to several columns in a PySpark Dataframe
                            
                                Spark sql how to explode without losing null values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With