I have a RDD
, the RDD' partition
of result changes to 200 when I use window
,can I not change partition
when I use window
?
This is my code:
val rdd= sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val result = rdd.toDF("values").withColumn("csum", sum(col("values")).over(Window.partitionBy(col("values")))).rdd
println(result.getNumPartitions + "rdd2")
My input partition is 4, why result partition is 200?
I want my result partition to be also 4.
Is there any cleaner solution?
Note: As mentioned by @eliasah - it's not possible to avoid repartition when using window functions with spark
- Why result partition is 200?
Spark doc
The default value of spark.sql.shuffle.partitions
which
Configures the number of partitions to use when shuffling data for joins or aggregations - is 200
- How can I repartition to 4?
You can use:
coalesce(4)
or
repartition(4)
spark doc
coalesce(numPartitions) Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With