Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to apply the describe function after grouping a PySpark DataFrame?

I want to find the cleanest way to apply the describe function to a grouped DataFrame (this question can also grow to apply any DF function to a grouped DF)

I tested grouped aggregate pandas UDF with no luck. There's always a way of doing it by passing each statistics inside the agg function but that's not the proper way.

If we have a sample dataframe:

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))

The idea would be to do something similar to Pandas:

df.groupby("id").describe()

where the result would be:

                                                   v
    count mean     std    min   25%    50%  75%   max
id                              
1   2.0   1.5   0.707107  1.0   1.25   1.5  1.75  2.0
2   3.0   6.0   3.605551  3.0   4.00   5.0  7.50  10.0

Thanks.

like image 383
Favio Vázquez Avatar asked Jul 17 '19 20:07

Favio Vázquez


2 Answers

Try this:

df = (df
    .groupby("id")
    .agg(F.count('v').alias('count'),
         F.mean('v').alias('mean'),
         F.stddev('v').alias('std'),
         F.min('v').alias('min'),
         F.expr('percentile(v, array(0.25))')[0].alias('%25'),
         F.expr('percentile(v, array(0.5))')[0].alias('%50'),
         F.expr('percentile(v, array(0.75))')[0].alias('%75'),
         F.max('v').alias('max')))
df.show()

Output:

+---+-----+----+------------------+---+----+---+----+----+
| id|count|mean|               std|min| %25|%50| %75| max|
+---+-----+----+------------------+---+----+---+----+----+
|  1|    2| 1.5|0.7071067811865476|1.0|1.25|1.5|1.75| 2.0|
|  2|    3| 6.0| 3.605551275463989|3.0| 4.0|5.0| 7.5|10.0|
+---+-----+----+------------------+---+----+---+----+----+
like image 124
Ala Tarighati Avatar answered Oct 25 '22 06:10

Ala Tarighati


If you have a utility function module you could put something like this in it and call a one liner afterwards.

import pyspark.sql.functions as F

def groupby_apply_describe(df, groupby_col, stat_col):
    """From a grouby df object provide the stats
    of describe for each key in the groupby object.

    Parameters
    ----------
    df : spark dataframe groupby object
    col : column to compute statistics on
    
    """
    output = df.groupby(groupby_col).agg(
        F.count(stat_col).alias("count"),
        F.mean(stat_col).alias("mean"),
        F.stddev(stat_col).alias("std"),
        F.min(stat_col).alias("min"),
        F.expr(f"percentile({stat_col}, array(0.25))")[0].alias("%25"),
        F.expr(f"percentile({stat_col}, array(0.5))")[0].alias("%50"),
        F.expr(f"percentile({stat_col}, array(0.75))")[0].alias("%75"),
        F.max(stat_col).alias("max"),
    )
    print(output.orderBy(groupby_col).show())
    return output

In your case you would call groupby_apply_describe(df, 'id', 'v'). The output should match your requirements.

like image 42
Michael Hoffman Avatar answered Oct 25 '22 05:10

Michael Hoffman