Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark aggregations where output columns are functions and rows are columns

I want to compute a bunch of different agg functions on different columns in a dataframe.

I know I can do something like this, but the output is all one row.

df.agg(max("cola"), min("cola"), max("colb"), min("colb"))

Let's say I will be performing 100 different aggregations on 10 different columns.

I want the output dataframe to be like this -

      |Min|Max|AnotherAggFunction1|AnotherAggFunction2|...etc..
cola  | 1 | 10| ... 
colb  | 2 | NULL| ... 
colc  | 5 | 20| ... 
cold  | NULL | 42| ... 
...

Where my rows are each column I am performing aggregations on and my columns are the aggregation functions. Some areas will be null if I don't calculated colb max for example.

How can I accomplish this?

like image 846
asojidaiod Avatar asked Dec 12 '25 19:12

asojidaiod


1 Answers

You can create a Map column, say Metrics, where keys are column names and values a struct of aggregations (max, min, avg, ...). I am using map_from_entries function to create a map column (available from Spark 2.4+). And then, simply explode the map to get the structure you want.

Here is an example you can adapt for your requirement:

df = spark.createDataFrame([("A", 1, 2), ("B", 2, 4), ("C", 5, 6), ("D", 6, 8)], ['cola', 'colb', 'colc'])

agg = map_from_entries(array(
    *[
        struct(lit(c),
               struct(max(c).alias("Max"), min(c).alias("Min"))
               )
        for c in df.columns
    ])).alias("Metrics")

df.agg(agg).select(explode("Metrics").alias("col", "Metrics")) \
    .select("col", "Metrics.*") \
    .show()

#+----+---+---+
#|col |Max|Min|
#+----+---+---+
#|cola|D  |A  |
#|colb|6  |1  |
#|colc|8  |2  |
#+----+---+---+
like image 72
blackbishop Avatar answered Dec 14 '25 10:12

blackbishop