how can I keep partition'number not change when I use window.partitionBy() function with spark/scala?

Question

I have a RDD , the RDD' partition of result changes to 200 when I use window,can I not change partition when I use window?

This is my code:

val rdd= sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val result = rdd.toDF("values").withColumn("csum", sum(col("values")).over(Window.partitionBy(col("values")))).rdd
println(result.getNumPartitions + "rdd2")

My input partition is 4, why result partition is 200?

I want my result partition to be also 4.

Is there any cleaner solution?

Yaron · Accepted Answer

Note: As mentioned by @eliasah - it's not possible to avoid repartition when using window functions with spark

Why result partition is 200?

Spark doc The default value of spark.sql.shuffle.partitions which Configures the number of partitions to use when shuffling data for joins or aggregations - is 200

How can I repartition to 4?

You can use:

coalesce(4)

or

repartition(4)

spark doc

coalesce(numPartitions) Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.

repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

how can I keep partition'number not change when I use window.partitionBy() function with spark/scala?

Tags:

scala

apache-spark

apache-spark-sql

mentongwu

1 Answers

Yaron

Recent Activity

Donate For Us

how can I keep partition'number not change when I use window.partitionBy() function with spark/scala?

Tags:

scala

apache-spark

apache-spark-sql

mentongwu

1 Answers

Yaron

Related questions

Recent Activity

Donate For Us