What is the difference between DataFrame <code>repartition()</code> and DataFrameWriter <code>partitionBy()</code> methods? I hope both are used to "partition data based on dataframe column"? Or is there any difference?

Watch out: I believe the accepted answer is not quite right! I'm glad you ask this question, because the behavior of these similarly-named functions differs in important and unexpected ways that are not well documented in the official spark documentation. The first part of the accepted answer is correct: calling <code>df.repartition(COL, numPartitions=k)</code> will create a dataframe with <code>k</code> partitions using a hash-based partitioner. <code>COL</code> here defines the partitioning key--it can be a single column or a list of columns. The hash-based partitioner takes each input row's partition key, hashes it into a space of <code>k</code> partitions via something like <code>partition = hash(partitionKey) % k</code>. This guarantees that all rows with the same partition key end up in the same partition. However, rows from multiple partition keys can also end up in the same partition (when a hash collision between the partition keys occurs) and some partitions might be empty. In summary, the unintuitive aspects of <code>df.repartition(COL, numPartitions=k)</code> are that <ul> <li>partitions will not strictly segregate partition keys</li> <li>some of your <code>k</code> partitions may be empty, whereas others may contain rows from multiple partition keys</li> </ul> The behavior of <code>df.write.partitionBy</code> is quite different, in a way that many users won't expect. Let's say that you want your output files to be date-partitioned, and your data spans over 7 days. Let's also assume that <code>df</code> has 10 partitions to begin with. When you run <code>df.write.partitionBy('day')</code>, how many output files should you expect? The answer is 'it depends'. If each partition of your starting partitions in <code>df</code> contains data from each day, then the answer is 70. If each of your starting partitions in <code>df</code> contains data from exactly one day, then the answer is 10. How can we explain this behavior? When you run <code>df.write</code>, each of the original partitions in <code>df</code> is written independently. That is, each of your original 10 partitions is sub-partitioned separately on the 'day' column, and a separate file is written for each sub-partition. I find this behavior rather annoying and wish there were a way to do a global repartitioning when writing dataframes.

Difference between df.repartition and DataFrameWriter partitionBy?

2 Answers

Watch out: I believe the accepted answer is not quite right! I'm glad you ask this question, because the behavior of these similarly-named functions differs in important and unexpected ways that are not well documented in the official spark documentation.

The first part of the accepted answer is correct: calling df.repartition(COL, numPartitions=k) will create a dataframe with k partitions using a hash-based partitioner. COL here defines the partitioning key--it can be a single column or a list of columns. The hash-based partitioner takes each input row's partition key, hashes it into a space of k partitions via something like partition = hash(partitionKey) % k. This guarantees that all rows with the same partition key end up in the same partition. However, rows from multiple partition keys can also end up in the same partition (when a hash collision between the partition keys occurs) and some partitions might be empty.

In summary, the unintuitive aspects of df.repartition(COL, numPartitions=k) are that

partitions will not strictly segregate partition keys
some of your k partitions may be empty, whereas others may contain rows from multiple partition keys

The behavior of df.write.partitionBy is quite different, in a way that many users won't expect. Let's say that you want your output files to be date-partitioned, and your data spans over 7 days. Let's also assume that df has 10 partitions to begin with. When you run df.write.partitionBy('day'), how many output files should you expect? The answer is 'it depends'. If each partition of your starting partitions in df contains data from each day, then the answer is 70. If each of your starting partitions in df contains data from exactly one day, then the answer is 10.

How can we explain this behavior? When you run df.write, each of the original partitions in df is written independently. That is, each of your original 10 partitions is sub-partitioned separately on the 'day' column, and a separate file is written for each sub-partition.

I find this behavior rather annoying and wish there were a way to do a global repartitioning when writing dataframes.

132

answered Sep 24 '22 18:09

conradlee

If you run repartition(COL) you change the partitioning during calculations - you will get spark.sql.shuffle.partitions (default: 200) partitions. If you then call .write you will get one directory with many files.

If you run .write.partitionBy(COL) then as the result you will get as many directories as unique values in COL. This speeds up futher data reading (if you filter by partitioning column) and saves some space on storage (partitioning column is removed from data files).

UPDATE: See @conradlee's answer. He explains in details not only how the directories structure will look like after applying different methods but also what will be resulting number of files in both scenarios.

answered Sep 24 '22 18:09

Mariusz

Related questions
                            
                                PySpark: multiple conditions in when clause
                            
                                Find maximum row per group in Spark DataFrame
                            
                                How do I detect if a Spark DataFrame has a column
                            
                                'PipelinedRDD' object has no attribute 'toDF' in PySpark
                            
                                Median / quantiles within PySpark groupBy
                            
                                Upacking a list to select multiple columns from a spark data frame
                            
                                Apache Spark -- Assign the result of UDF to multiple dataframe columns
                            
                                PySpark: withColumn() with two conditions and three outcomes
                            
                                How to flatten a struct in a Spark dataframe?
                            
                                Automatically and Elegantly flatten DataFrame in Spark SQL
                            
                                How to split Vector into columns - using PySpark
                            
                                aggregate function Count usage with groupBy in Spark
                            
                                What are the various join types in Spark?
                            
                                Pyspark: Filter dataframe based on multiple conditions
                            
                                How to melt Spark DataFrame?
                            
                                Generate a Spark StructType / Schema from a case class
                            
                                Spark functions vs UDF performance?
                            
                                PySpark - rename more than one column using withColumnRenamed
                            
                                Retrieve top n in each group of a DataFrame in pyspark
                            
                                How to import multiple csv files in a single load?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between df.repartition and DataFrameWriter partitionBy?

Tags:

apache-spark-sql

data-partitioning

Shankar

People also ask

2 Answers

conradlee

Mariusz

Recent Activity

Donate For Us