I have columns X (string), Y (string), and Z (float).
And I want to
If there are multiple values for column Y that correspond to the maximum for column Z, then take the maximum of those values in column Y.
For example, my table is like: table1:
col X col Y col Z
A 1 5
A 2 10
A 3 10
B 5 15
resulting in:
A 3 10
B 5 15
If I were using SQL, I would do it like this:
select X, Y, Z
from table1
join (select max(Z) as max_Z from table1 group by X) table2
on table1.Z = table2.max_Z
However how do I do this when 1) column Z is a float? and 2) I'm using pyspark sql?
The two following solutions are in Scala, but honestly could not resist posting them to promote my beloved window aggregate functions. Sorry.
The only question is which structured query is more performant/effective?
val df = Seq(
("A",1,5),
("A",2,10),
("A",3,10),
("B",5,15)
).toDF("x", "y", "z")
scala> df.show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 1| 5|
| A| 2| 10|
| A| 3| 10|
| B| 5| 15|
+---+---+---+
// describe window specification
import org.apache.spark.sql.expressions.Window
val byX = Window.partitionBy("x").orderBy($"z".desc).orderBy($"y".desc)
// use rank to calculate the best X
scala> df.withColumn("rank", rank over byX)
.select("x", "y", "z")
.where($"rank" === 1) // <-- take the first row
.orderBy("x")
.show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
I've always been thinking about the alternatives to rank function and first usually sprung to mind.
// use first and dropDuplicates
scala> df.
withColumn("y", first("y") over byX).
withColumn("z", first("z") over byX).
dropDuplicates.
orderBy("x").
show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With