Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Polars solution to normalise groups by per-group reference value

I'm trying to use Polars to normalise the values of groups of entries by a single reference value per group.

Sample data:

import polars as pl

df = pl.from_repr("""
┌──────────┬─────────────────┬───────┐
│ group_id ┆ reference_state ┆ value │
│ ---      ┆ ---             ┆ ---   │
│ i64      ┆ str             ┆ i64   │
╞══════════╪═════════════════╪═══════╡
│ 1        ┆ ref             ┆ 5     │
│ 1        ┆ a               ┆ 3     │
│ 1        ┆ b               ┆ 1     │
│ 2        ┆ ref             ┆ 4     │
│ 2        ┆ a               ┆ 8     │
│ 2        ┆ b               ┆ 2     │
└──────────┴─────────────────┴───────┘
""")

I'm trying to generate the column normalised which contains value divided by the per-group ref reference state value.

This is straightforward in Pandas:

df = df.to_pandas()

for (i, x) in df.groupby("group_id"):
    ref_val = x.loc[x["reference_state"] == "ref"]["value"]
    df.loc[df["group_id"] == i, "normalised"] = x["value"] / ref_val.to_list()[0]
    
pl.from_pandas(df)
shape: (6, 4)
┌──────────┬─────────────────┬───────┬────────────┐
│ group_id ┆ reference_state ┆ value ┆ normalised │
│ ---      ┆ ---             ┆ ---   ┆ ---        │
│ i64      ┆ str             ┆ i64   ┆ f64        │
╞══════════╪═════════════════╪═══════╪════════════╡
│ 1        ┆ ref             ┆ 5     ┆ 1.0        │
│ 1        ┆ a               ┆ 3     ┆ 0.6        │
│ 1        ┆ b               ┆ 1     ┆ 0.2        │
│ 2        ┆ ref             ┆ 4     ┆ 1.0        │
│ 2        ┆ a               ┆ 8     ┆ 2.0        │
│ 2        ┆ b               ┆ 2     ┆ 0.5        │
└──────────┴─────────────────┴───────┴────────────┘

Is there a way to do this in Polars?

Thanks in advance!

like image 943
Tuatar Avatar asked Mar 18 '26 04:03

Tuatar


1 Answers

You can use a window function to make an expression operate on different groups via:

.over("group_id")

and then you can write the logic which divides by the values if equal to "ref" with:

pl.col("value") / pl.col("value").filter(pl.col("reference_state") == "ref").first()

Putting it all together:

df = pl.DataFrame({
    "group_id": [1, 1, 1, 2, 2, 2],
    "reference_state": ["ref", "a", "b", "ref", "a", "b"],
    "value": [5, 3, 1, 4, 8, 2],
})

(df.with_columns(
    (
        pl.col("value") / 
        pl.col("value").filter(pl.col("reference_state") == "ref").first()
    ).over("group_id").alias("normalised")
    
))
shape: (6, 4)
┌──────────┬─────────────────┬───────┬────────────┐
│ group_id ┆ reference_state ┆ value ┆ normalised │
│ ---      ┆ ---             ┆ ---   ┆ ---        │
│ i64      ┆ str             ┆ i64   ┆ f64        │
╞══════════╪═════════════════╪═══════╪════════════╡
│ 1        ┆ ref             ┆ 5     ┆ 1.0        │
│ 1        ┆ a               ┆ 3     ┆ 0.6        │
│ 1        ┆ b               ┆ 1     ┆ 0.2        │
│ 2        ┆ ref             ┆ 4     ┆ 1.0        │
│ 2        ┆ a               ┆ 8     ┆ 2.0        │
│ 2        ┆ b               ┆ 2     ┆ 0.5        │
└──────────┴─────────────────┴───────┴────────────┘


like image 190
ritchie46 Avatar answered Mar 20 '26 16:03

ritchie46