How to control number of parquet files generated when using partitionBy

Question

I have a DataFrame that I need to write to S3 according to a specific partitioning. The code looks like this:

dataframe
  .write
  .mode(SaveMode.Append)
  .partitionBy("year", "month", "date", "country", "predicate")
  .parquet(outputPath)

The partitionBy split the data into a fairly large number of folders (~400) with just a little bit of data (~1GB) in each. And here comes the problem - because the default value of spark.sql.shuffle.partitions is 200, the 1GB of data in each folder is split into 200 small parquet files, resulting in roughly 80000 parquet files being written in total. This is not optimal for a number of reasons and I would like to avoid this.

I can of course set the spark.sql.shuffle.partitions to a much smaller number, say 10, but as I understand this setting also controls the number of partitions for shuffles in joins and aggregation, so I don't really want to change this.

Does anyone know if there is another way to control how many files are written?

Marius Soutier · Accepted Answer

As you noted correctly, spark.sql.shuffle.partitions only applies to shuffles and joins in SparkSQL.

partitionBy in DataFrameWriter (you move from DataFrame to DataFrameWriter as soon as you call write) simply operates on the previous number of partitions. (The writer's partitionBy only assigns columns to the table / parquet file that will be written out, so it has nothing to do with the number of partitions. This is a bit confusing.)

Long story short, just repartition the DataFrame before you transform it to a writer.

How to control number of parquet files generated when using partitionBy

Tags:

apache-spark

spark-dataframe

Glennie Helles Sindholt

1 Answers

Marius Soutier

Recent Activity

Donate For Us

How to control number of parquet files generated when using partitionBy

Tags:

apache-spark

spark-dataframe

Glennie Helles Sindholt

1 Answers

Marius Soutier

Related questions

Recent Activity

Donate For Us