I want to find the cleanest way to apply the describe
function to a grouped DataFrame (this question can also grow to apply any DF function to a grouped DF)
I tested grouped aggregate pandas UDF with no luck. There's always a way of doing it by passing each statistics inside the agg
function but that's not the proper way.
If we have a sample dataframe:
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
The idea would be to do something similar to Pandas:
df.groupby("id").describe()
where the result would be:
v
count mean std min 25% 50% 75% max
id
1 2.0 1.5 0.707107 1.0 1.25 1.5 1.75 2.0
2 3.0 6.0 3.605551 3.0 4.00 5.0 7.50 10.0
Thanks.
Try this:
df = (df
.groupby("id")
.agg(F.count('v').alias('count'),
F.mean('v').alias('mean'),
F.stddev('v').alias('std'),
F.min('v').alias('min'),
F.expr('percentile(v, array(0.25))')[0].alias('%25'),
F.expr('percentile(v, array(0.5))')[0].alias('%50'),
F.expr('percentile(v, array(0.75))')[0].alias('%75'),
F.max('v').alias('max')))
df.show()
Output:
+---+-----+----+------------------+---+----+---+----+----+
| id|count|mean| std|min| %25|%50| %75| max|
+---+-----+----+------------------+---+----+---+----+----+
| 1| 2| 1.5|0.7071067811865476|1.0|1.25|1.5|1.75| 2.0|
| 2| 3| 6.0| 3.605551275463989|3.0| 4.0|5.0| 7.5|10.0|
+---+-----+----+------------------+---+----+---+----+----+
If you have a utility function module you could put something like this in it and call a one liner afterwards.
import pyspark.sql.functions as F
def groupby_apply_describe(df, groupby_col, stat_col):
"""From a grouby df object provide the stats
of describe for each key in the groupby object.
Parameters
----------
df : spark dataframe groupby object
col : column to compute statistics on
"""
output = df.groupby(groupby_col).agg(
F.count(stat_col).alias("count"),
F.mean(stat_col).alias("mean"),
F.stddev(stat_col).alias("std"),
F.min(stat_col).alias("min"),
F.expr(f"percentile({stat_col}, array(0.25))")[0].alias("%25"),
F.expr(f"percentile({stat_col}, array(0.5))")[0].alias("%50"),
F.expr(f"percentile({stat_col}, array(0.75))")[0].alias("%75"),
F.max(stat_col).alias("max"),
)
print(output.orderBy(groupby_col).show())
return output
In your case you would call groupby_apply_describe(df, 'id', 'v')
. The output should match your requirements.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With