I want to compute a bunch of different agg functions on different columns in a dataframe.
I know I can do something like this, but the output is all one row.
df.agg(max("cola"), min("cola"), max("colb"), min("colb"))
Let's say I will be performing 100 different aggregations on 10 different columns.
I want the output dataframe to be like this -
|Min|Max|AnotherAggFunction1|AnotherAggFunction2|...etc..
cola | 1 | 10| ...
colb | 2 | NULL| ...
colc | 5 | 20| ...
cold | NULL | 42| ...
...
Where my rows are each column I am performing aggregations on and my columns are the aggregation functions. Some areas will be null if I don't calculated colb max for example.
How can I accomplish this?
You can create a Map column, say Metrics, where keys are column names and values a struct of aggregations (max, min, avg, ...). I am using map_from_entries function to create a map column (available from Spark 2.4+). And then, simply explode the map to get the structure you want.
Here is an example you can adapt for your requirement:
df = spark.createDataFrame([("A", 1, 2), ("B", 2, 4), ("C", 5, 6), ("D", 6, 8)], ['cola', 'colb', 'colc'])
agg = map_from_entries(array(
*[
struct(lit(c),
struct(max(c).alias("Max"), min(c).alias("Min"))
)
for c in df.columns
])).alias("Metrics")
df.agg(agg).select(explode("Metrics").alias("col", "Metrics")) \
.select("col", "Metrics.*") \
.show()
#+----+---+---+
#|col |Max|Min|
#+----+---+---+
#|cola|D |A |
#|colb|6 |1 |
#|colc|8 |2 |
#+----+---+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With