I try to repartition a DataFrame according to a column the the DataFrame has <code>N</code> (let say <code>N=3</code>) different values in the partition-column <code>x</code>, e.g: <pre class="prettyprint"><code>val myDF = sc.parallelize(Seq(1,1,2,2,3,3)).toDF("x") // create dummy data </code></pre> What I like to achieve is to repartiton <code>myDF</code> by <code>x</code> without producing empty partitions. Is there a better way than doing this? <pre class="prettyprint"><code>val numParts = myDF.select($"x").distinct().count.toInt myDF.repartition(numParts,$"x") </code></pre> (If I don't specify <code>numParts</code> in <code>repartiton</code>, most of my partitions are empty (as <code>repartition</code> creates 200 partitions) ...)

I'd think of solution with iterating over <code>df</code> partition and fetching record count in it to find non-empty partitions. <pre class="prettyprint"><code>val nonEmptyPart = sparkContext.longAccumulator("nonEmptyPart") df.foreachPartition(partition => if (partition.length > 0) nonEmptyPart.add(1)) </code></pre> As we got non-empty partitions (<code>nonEmptyPart</code>), we can clean empty partitions by using <code>coalesce()</code> (check coalesce() vs repartition()). <pre class="prettyprint"><code>val finalDf = df.coalesce(nonEmptyPart.value.toInt) //coalesce() accepts only Int type </code></pre> It may or may not be the best, but this solution will avoid shuffling as we are not using <code>repartition()</code> <hr> <h3>Example to address comment</h3> <pre class="prettyprint"><code>val df1 = sc.parallelize(Seq(1, 1, 2, 2, 3, 3)).toDF("x").repartition($"x") val nonEmptyPart = sc.longAccumulator("nonEmptyPart") df1.foreachPartition(partition => if (partition.length > 0) nonEmptyPart.add(1)) val finalDf = df1.coalesce(nonEmptyPart.value.toInt) println(s"nonEmptyPart => ${nonEmptyPart.value.toInt}") println(s"df.rdd.partitions.length => ${df1.rdd.partitions.length}") println(s"finalDf.rdd.partitions.length => ${finalDf.rdd.partitions.length}") </code></pre> Output <pre class="prettyprint"><code>nonEmptyPart => 3 df.rdd.partitions.length => 200 finalDf.rdd.partitions.length => 3 </code></pre>

Dropping empty DataFrame partitions in Apache Spark

Tags:

scala

apache-spark

apache-spark-sql

I try to repartition a DataFrame according to a column the the DataFrame has N (let say N=3) different values in the partition-column x, e.g:

val myDF = sc.parallelize(Seq(1,1,2,2,3,3)).toDF("x") // create dummy data

What I like to achieve is to repartiton myDF by x without producing empty partitions. Is there a better way than doing this?

val numParts = myDF.select($"x").distinct().count.toInt
myDF.repartition(numParts,$"x")

(If I don't specify numParts in repartiton, most of my partitions are empty (as repartition creates 200 partitions) ...)

451

asked Jan 25 '17 15:01

Raphael Roth

1 Answers

I'd think of solution with iterating over df partition and fetching record count in it to find non-empty partitions.

val nonEmptyPart = sparkContext.longAccumulator("nonEmptyPart") 

df.foreachPartition(partition =>
  if (partition.length > 0) nonEmptyPart.add(1))

As we got non-empty partitions (nonEmptyPart), we can clean empty partitions by using coalesce() (check coalesce() vs repartition()).

val finalDf = df.coalesce(nonEmptyPart.value.toInt) //coalesce() accepts only Int type

It may or may not be the best, but this solution will avoid shuffling as we are not using repartition()

Example to address comment

val df1 = sc.parallelize(Seq(1, 1, 2, 2, 3, 3)).toDF("x").repartition($"x")
val nonEmptyPart = sc.longAccumulator("nonEmptyPart")

df1.foreachPartition(partition =>
  if (partition.length > 0) nonEmptyPart.add(1))

val finalDf = df1.coalesce(nonEmptyPart.value.toInt)

println(s"nonEmptyPart => ${nonEmptyPart.value.toInt}")
println(s"df.rdd.partitions.length => ${df1.rdd.partitions.length}")
println(s"finalDf.rdd.partitions.length => ${finalDf.rdd.partitions.length}")

Output

nonEmptyPart => 3
df.rdd.partitions.length => 200
finalDf.rdd.partitions.length => 3

143

answered Oct 18 '22 22:10

mrsrinivas

Related questions
                            
                                How to manage Authentication/Authorization for user requests sent from native mobile apps calling Play2!-Scala REST services
                            
                                Scala: Copying case classes with trait
                            
                                Does Scala have a small number of underlying syntactic features?
                            
                                Play 2.1 Json serialization for traits?
                            
                                Scala constant expressions and computed string literals
                            
                                Why don't Scala primitives show up as type parameters in Java reflection?
                            
                                Overriding variance in subtype
                            
                                sbt assembly error - deduplicate: different file contents found in the following
                            
                                sbt task to increment project version
                            
                                Why is `Resolving` so slow in the compiling stage of SBT?
                            
                                Using SBT to manage projects that contain both Scala and Python
                            
                                How do I enrich a package object?
                            
                                Pass array as an UDF parameter in Spark SQL
                            
                                akka-http: How to set response headers
                            
                                Use Scala macros to generate methods
                            
                                Shebang "#!" starts and "!#" ends?
                            
                                Explicitly output JSON null in case of missing optional value
                            
                                How to iterate scala wrappedArray? (Spark)
                            
                                How to create Spark Dataset or Dataframe from case classes that contains Enums
                            
                                Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With