Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Polars Groupby Describe Extension

df is a demo Polars DataFrame:

df = pl.DataFrame(
    {
        "groups": ["A", "A", "A", "B", "B", "B"],
        "values": [1, 2, 3, 4, 5, 6],
        }
)

The current group_by.agg() apporach is a bit inconvinient for creating descriptive statistics:

print(
    df.group_by("groups").agg(
    pl.len().alias("count"),
    pl.col("values").mean().alias("mean"),
    pl.col("values").std().alias("std"),
    pl.col("values").min().alias("min"),
    pl.col("values").quantile(0.25).alias("25%"),
    pl.col("values").quantile(0.5).alias("50%"),
    pl.col("values").quantile(0.75).alias("75%"),
    pl.col("values").max().alias("max"),
    pl.col("values").skew().alias("skew"),
    pl.col("values").kurtosis().alias("kurtosis"),
)
)

out:
shape: (2, 11)
┌────────┬───────┬──────┬─────┬───┬─────┬─────┬──────┬──────────┐
│ groups ┆ count ┆ mean ┆ std ┆ … ┆ 75% ┆ max ┆ skew ┆ kurtosis │
│ ---    ┆ ---   ┆ ---  ┆ --- ┆   ┆ --- ┆ --- ┆ ---  ┆ ---      │
│ str    ┆ u32   ┆ f64  ┆ f64 ┆   ┆ f64 ┆ i64 ┆ f64  ┆ f64      │
╞════════╪═══════╪══════╪═════╪═══╪═════╪═════╪══════╪══════════╡
│ B      ┆ 3     ┆ 5.0  ┆ 1.0 ┆ … ┆ 6.0 ┆ 6   ┆ 0.0  ┆ -1.5     │
│ A      ┆ 3     ┆ 2.0  ┆ 1.0 ┆ … ┆ 3.0 ┆ 3   ┆ 0.0  ┆ -1.5     │
└────────┴───────┴──────┴─────┴───┴─────┴─────┴──────┴──────────┘

I want to write a customized group_by extension module that allows me to achieve the same results by calling:

df.describe(by="groups", percentiles=[xxx], skew=True, kurt=True)

or

df.group_by("groups").describe(percentiles=....)
like image 939
Kevin Li Avatar asked Oct 16 '25 00:10

Kevin Li


1 Answers

Calling this will output as same as you mentioned in the question

import polars as pl


class DescribeAccessor:
    def __init__(self, df: pl.DataFrame):
        self._df = df

    def __call__(
            self,
            by: str,
            percentiles: list = [0.25, 0.5, 0.75],
            skew: bool = True,
            kurt: bool = True,
    ) -> pl.DataFrame:
        percentile_exprs = [
            pl.col("values").quantile(p).alias(f"{int(p * 100)}%")
            for p in percentiles
        ]

        aggs = [
            pl.len().alias("count"),
            pl.col("values").mean().alias("mean"),
            pl.col("values").std().alias("std"),
            pl.col("values").min().alias("min"),
            *percentile_exprs,
            pl.col("values").max().alias("max"),
        ]

        if skew:
            aggs.append(pl.col("values").skew().alias("skew"))

        if kurt:
            aggs.append(pl.col("values").kurtosis().alias("kurtosis"))

        return self._df.group_by(by).agg(aggs)


pl.DataFrame.describe = property(lambda self: DescribeAccessor(self))

df = pl.DataFrame(
    {
        "groups": ["A", "A", "A", "B", "B", "B"],
        "values": [1, 2, 3, 4, 5, 6],
    }
)

print(df.describe(by="groups", percentiles=[0.25, 0.5, 0.75], skew=True, kurt=True))
like image 86
meshkati Avatar answered Oct 17 '25 16:10

meshkati



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!