Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how can I keep partition'number not change when I use window.partitionBy() function with spark/scala?

I have a RDD , the RDD' partition of result changes to 200 when I use window,can I not change partition when I use window?

This is my code:

val rdd= sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val result = rdd.toDF("values").withColumn("csum", sum(col("values")).over(Window.partitionBy(col("values")))).rdd
println(result.getNumPartitions + "rdd2")

My input partition is 4, why result partition is 200?

I want my result partition to be also 4.

Is there any cleaner solution?

like image 573
mentongwu Avatar asked Feb 04 '23 07:02

mentongwu


1 Answers

Note: As mentioned by @eliasah - it's not possible to avoid repartition when using window functions with spark


  • Why result partition is 200?

Spark doc The default value of spark.sql.shuffle.partitions which Configures the number of partitions to use when shuffling data for joins or aggregations - is 200

  • How can I repartition to 4?

You can use:

coalesce(4)

or

repartition(4)

spark doc

coalesce(numPartitions) Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.

repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

like image 54
Yaron Avatar answered Feb 07 '23 01:02

Yaron