How to filter rows for a specific aggregate with spark sql?

Tags:

Normally all rows in a group are passed to an aggregate function. I would like to filter rows using a condition so that only some rows within a group are passed to an aggregate function. Such operation is possible with PostgreSQL. I would like to do the same thing with Spark SQL DataFrame (Spark 2.0.0).

The code could probably look like this:

val df = ... // some data frame
df.groupBy("A").agg(
  max("B").where("B").less(10), // there is no such method as `where` :(
  max("C").where("C").less(5)
)

So for a data frame like this:

| A | B | C |
|  1| 14|  4|
|  1|  9|  3|
|  2|  5|  6|

The result would be:

|A|max(B)|max(C)|
|1|    9|      4|
|2|    5|   null|

Is it possible with Spark SQL?

Note that in general any other aggregate function than max could be used and there could be multiple aggregates over the same column with arbitrary filtering conditions.

606

asked Sep 26 '16 22:09

Marcin Król

1 Answers

val df = Seq(
    (1,14,4),
    (1,9,3),
    (2,5,6)
  ).toDF("a","b","c")

val aggregatedDF = df.groupBy("a")
  .agg(
    max(when($"b" < 10, $"b")).as("MaxB"),
    max(when($"c" < 5, $"c")).as("MaxC")
  )

aggregatedDF.show

answered Nov 02 '22 22:11

user2682459

Related questions
                            
                                SQL-Date-Question: How to get Yesterdays date in the following formatte
                            
                                Database Design Question - Categories / Subcategories
                            
                                Get list of table names in different schema of an Oracle database [duplicate]
                            
                                t-sql GROUP BY with COUNT, and then include MAX from the COUNT
                            
                                Update a column for all the rows
                            
                                How would you add a column that only has a set choice of values?
                            
                                Infinite loop in CTE when parsing self-referencing table
                            
                                JDBC Transaction vs Connection Clarification
                            
                                Putting JSON string as field data on MySQL
                            
                                How to find top-X highest values in column using Django Queryset without cutting off ties at the bottom?
                            
                                SQL exercises/queries with sample database [closed]
                            
                                ERROR: failed to find conversion function from unknown to text
                            
                                Missing IN or OUT parameter at index:: 1 error in Java, Oracle
                            
                                Database index on a column with duplicate values
                            
                                Update query if statement for Oracle
                            
                                Get pointer to a struct field value
                            
                                Cut string after first occurrence of a character
                            
                                postgresql update multiple tables in single query
                            
                                ExecuteReader() in Powershell script
                            
                                Remove duplicate sub-query

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to filter rows for a specific aggregate with spark sql?

Tags:

sql

aggregate

apache-spark

apache-spark-sql

spark-dataframe

Marcin Król

People also ask

1 Answers

user2682459

Recent Activity

Donate For Us