Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partition Spark DataFrame based on column

Tags:

apache-spark

I'm trying to partition a Spark DataFrame based on the column "b" using groupByKey() but I end up having different groups in the same partition.

Here is the Data Frame and the code I'm using:

df:
+---+---+
|  a|  b|
+---+---+
|  4|  2|
|  5|  1|
|  1|  4|
|  2|  2|
+---+---+


 val partitions = df.map(x => x.getLong(1)).distinct().count().toInt
 val df2 = df.map(r => (r.getLong(1), r)).groupByKey(partitions)
 val gb = df2.mapPartitions(iterator => {
            val rows = iterator.toList
            println(rows)
            iterator
            })

The printed rows are:
Partition 1: List((2,CompactBuffer([4,2], [2,2])))
Partition 2: List((4,CompactBuffer([1,4])), (1,CompactBuffer([5,1])))

Groups 4 and 1 are in the same partition (2) and I would like to have them in separate partitions, do you know how to do that?

Desired output:
Partition 1: List((2,CompactBuffer([4,2], [2,2])))
Partition 2: List((4,CompactBuffer([1,4])))
Partition 3: List((1,CompactBuffer([5,1])))

P.S. To give you a bit of context, I'm doing this because I need to update rows in a DataFrame using data from all the other rows sharing the same value for a specific column. Therefore map() is not enough, I'm currently trying to use mapPartitions() where each partition would contain all the rows having a the same value for the specific column. Don't hesitate to tell me if you know a better way of doing this :)

Thanks a lot!

ClydeX

like image 202
ClydeX Avatar asked Jun 14 '26 01:06

ClydeX


1 Answers

It sounds like what you are trying to do, could be accomplished by using Window Functions: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

like image 139
Andreas Ryge Avatar answered Jun 16 '26 19:06

Andreas Ryge



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!