How to aggregate on one column and take maximum of others in pyspark?

Question

I have columns X (string), Y (string), and Z (float).

And I want to

aggregate on X
take the maximum of column Z
report ALL the values for columns X, Y, and Z

If there are multiple values for column Y that correspond to the maximum for column Z, then take the maximum of those values in column Y.

For example, my table is like: table1:

col X col Y col Z
A     1     5
A     2     10
A     3     10
B     5     15

resulting in:

A     3     10
B     5     15

If I were using SQL, I would do it like this:

select X, Y, Z 
from table1 
join (select max(Z) as max_Z from table1 group by X) table2
on table1.Z = table2.max_Z

However how do I do this when 1) column Z is a float? and 2) I'm using pyspark sql?

Jacek Laskowski · Accepted Answer

The two following solutions are in Scala, but honestly could not resist posting them to promote my beloved window aggregate functions. Sorry.

The only question is which structured query is more performant/effective?

Window Aggregate Function: rank

val df = Seq(
  ("A",1,5),
  ("A",2,10),
  ("A",3,10),
  ("B",5,15)
).toDF("x", "y", "z")

scala> df.show
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  A|  1|  5|
|  A|  2| 10|
|  A|  3| 10|
|  B|  5| 15|
+---+---+---+

// describe window specification
import org.apache.spark.sql.expressions.Window
val byX = Window.partitionBy("x").orderBy($"z".desc).orderBy($"y".desc)

// use rank to calculate the best X
scala> df.withColumn("rank", rank over byX)
  .select("x", "y", "z")
  .where($"rank" === 1) // <-- take the first row
  .orderBy("x")
  .show
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  A|  3| 10|
|  B|  5| 15|
+---+---+---+

Window Aggregate Function: first and dropDuplicates

I've always been thinking about the alternatives to rank function and first usually sprung to mind.

// use first and dropDuplicates
scala> df.
  withColumn("y", first("y") over byX).
  withColumn("z", first("z") over byX).
  dropDuplicates.
  orderBy("x").
  show
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  A|  3| 10|
|  B|  5| 15|
+---+---+---+

How to aggregate on one column and take maximum of others in pyspark?

Tags:

apache-spark

apache-spark-sql

pyspark

makansij

1 Answers

Window Aggregate Function: rank

Window Aggregate Function: first and dropDuplicates

Jacek Laskowski

Recent Activity

Donate For Us

How to aggregate on one column and take maximum of others in pyspark?

Tags:

apache-spark

apache-spark-sql

pyspark

makansij

1 Answers

Window Aggregate Function: rank

Window Aggregate Function: first and dropDuplicates

Jacek Laskowski

Related questions

Recent Activity

Donate For Us