How to max value and keep all columns (for max records per group)? [duplicate]

Tags:

apache-spark

apache-spark-sql

Given the following DataFrame:

+----+-----+---+-----+
| uid|    k|  v|count|
+----+-----+---+-----+
|   a|pref1|  b|  168|
|   a|pref3|  h|  168|
|   a|pref3|  t|   63|
|   a|pref3|  k|   84|
|   a|pref1|  e|   84|
|   a|pref2|  z|  105|
+----+-----+---+-----+

How can I get the max value from uid, k but include v?

+----+-----+---+----------+
| uid|    k|  v|max(count)|
+----+-----+---+----------+
|   a|pref1|  b|       168|
|   a|pref3|  h|       168|
|   a|pref2|  z|       105|
+----+-----+---+----------+

I can do something like this but it will drop the column "v" :

df.groupBy("uid", "k").max("count")

563

asked Mar 06 '17 21:03

jfgosselin

2 Answers

It's the perfect example for window operators (using over function) or join.

Since you've already figured out how to use windows, I focus on join exclusively.

scala> val inventory = Seq(
     |   ("a", "pref1", "b", 168),
     |   ("a", "pref3", "h", 168),
     |   ("a", "pref3", "t",  63)).toDF("uid", "k", "v", "count")
inventory: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 2 more fields]

scala> val maxCount = inventory.groupBy("uid", "k").max("count")
maxCount: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 1 more field]

scala> maxCount.show
+---+-----+----------+
|uid|    k|max(count)|
+---+-----+----------+
|  a|pref3|       168|
|  a|pref1|       168|
+---+-----+----------+

scala> val maxCount = inventory.groupBy("uid", "k").agg(max("count") as "max")
maxCount: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 1 more field]

scala> maxCount.show
+---+-----+---+
|uid|    k|max|
+---+-----+---+
|  a|pref3|168|
|  a|pref1|168|
+---+-----+---+

scala> maxCount.join(inventory, Seq("uid", "k")).where($"max" === $"count").show
+---+-----+---+---+-----+
|uid|    k|max|  v|count|
+---+-----+---+---+-----+
|  a|pref3|168|  h|  168|
|  a|pref1|168|  b|  168|
+---+-----+---+---+-----+

198

answered Oct 14 '22 03:10

Jacek Laskowski

Here's the best solution I came up with so far:

val w = Window.partitionBy("uid","k").orderBy(col("count").desc)

df.withColumn("rank", dense_rank().over(w)).select("uid", "k","v","count").where("rank == 1").show

answered Oct 14 '22 05:10

jfgosselin

Related questions
                            
                                PySpark 1.5 How to Truncate Timestamp to Nearest Minute from seconds
                            
                                Spark 1.6-Failed to locate the winutils binary in the hadoop binary path
                            
                                Spark - Random Number Generation
                            
                                Could not bind on a random free port error while trying to connect to spark master
                            
                                EntityTooLarge error when uploading a 5G file to Amazon S3
                            
                                How to get ID of a map task in Spark?
                            
                                pyspark matrix with dummy variables
                            
                                Spark column string replace when present in other column (row)
                            
                                Converting a Spark Dataframe to a Scala Map collection
                            
                                How to change the column type from String to Date in DataFrames?
                            
                                Remove rows from dataframe based on condition in pyspark
                            
                                Matrix Transpose on RowMatrix in Spark
                            
                                PySpark computing correlation
                            
                                How to update column based on a condition (a value in a group)?
                            
                                AuthorizationException: User not allowed to impersonate User
                            
                                How to CROSS JOIN 2 dataframe?
                            
                                Installing Apache Spark on Ubuntu 14.04
                            
                                Partition data for efficient joining for Spark dataframe/dataset
                            
                                Spark Option: inferSchema vs header = true
                            
                                Spark: Merge 2 dataframes by adding row index/number on both dataframes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to max value and keep all columns (for max records per group)? [duplicate]

Tags:

apache-spark

apache-spark-sql

jfgosselin

People also ask

2 Answers

Jacek Laskowski

jfgosselin

Recent Activity

Donate For Us