Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to aggregate on one column and take maximum of others in pyspark?

I have columns X (string), Y (string), and Z (float).

And I want to

  • aggregate on X
  • take the maximum of column Z
  • report ALL the values for columns X, Y, and Z

If there are multiple values for column Y that correspond to the maximum for column Z, then take the maximum of those values in column Y.

For example, my table is like: table1:

col X col Y col Z
A     1     5
A     2     10
A     3     10
B     5     15

resulting in:

A     3     10
B     5     15

If I were using SQL, I would do it like this:

select X, Y, Z 
from table1 
join (select max(Z) as max_Z from table1 group by X) table2
on table1.Z = table2.max_Z

However how do I do this when 1) column Z is a float? and 2) I'm using pyspark sql?

like image 646
makansij Avatar asked Dec 16 '25 12:12

makansij


1 Answers

The two following solutions are in Scala, but honestly could not resist posting them to promote my beloved window aggregate functions. Sorry.

The only question is which structured query is more performant/effective?

Window Aggregate Function: rank

val df = Seq(
  ("A",1,5),
  ("A",2,10),
  ("A",3,10),
  ("B",5,15)
).toDF("x", "y", "z")

scala> df.show
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  A|  1|  5|
|  A|  2| 10|
|  A|  3| 10|
|  B|  5| 15|
+---+---+---+

// describe window specification
import org.apache.spark.sql.expressions.Window
val byX = Window.partitionBy("x").orderBy($"z".desc).orderBy($"y".desc)

// use rank to calculate the best X
scala> df.withColumn("rank", rank over byX)
  .select("x", "y", "z")
  .where($"rank" === 1) // <-- take the first row
  .orderBy("x")
  .show
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  A|  3| 10|
|  B|  5| 15|
+---+---+---+

Window Aggregate Function: first and dropDuplicates

I've always been thinking about the alternatives to rank function and first usually sprung to mind.

// use first and dropDuplicates
scala> df.
  withColumn("y", first("y") over byX).
  withColumn("z", first("z") over byX).
  dropDuplicates.
  orderBy("x").
  show
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  A|  3| 10|
|  B|  5| 15|
+---+---+---+
like image 72
Jacek Laskowski Avatar answered Dec 19 '25 05:12

Jacek Laskowski



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!