I am trying to use polars to do a window aggregate over one value, but map it back to another.
For example, if i wanted to get the name of the max value in a group, instead of (or in combination to) just the max value.
assuming an input of something like this.
df = pl.from_repr("""
┌───────┬──────┬───────┐
│ label ┆ name ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪══════╪═══════╡
│ a. ┆ foo ┆ 1.0 │
│ a. ┆ bar ┆ 2.0 │
│ b. ┆ baz ┆ 1.5 │
│ b. ┆ boo ┆ -1.0 │
└───────┴──────┴───────┘
""")
# 'max_by' is not a real method, just using it to express what i'm trying to achieve.
df.select(pl.col('label'), pl.col('name').max_by('value').over('label'))
i want an output like this
shape: (2, 2)
┌───────┬──────┐
│ label ┆ name │
│ --- ┆ --- │
│ str ┆ str │
╞═══════╪══════╡
│ a. ┆ bar │
│ b. ┆ baz │
└───────┴──────┘
ideally with the value too. But i know i can easily add that in via pl.col('value').max().over('label').
shape: (2, 3)
┌───────┬──────┬───────┐
│ label ┆ name ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪══════╪═══════╡
│ a. ┆ bar ┆ 2.0 │
│ b. ┆ baz ┆ 1.5 │
└───────┴──────┴───────┘
You were close. There is a sort_by expression that can be used.
df.group_by('label').agg(pl.all().sort_by('value').last())
shape: (2, 3)
┌───────┬──────┬───────┐
│ label ┆ name ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪══════╪═══════╡
│ a. ┆ bar ┆ 2.0 │
│ b. ┆ baz ┆ 1.5 │
└───────┴──────┴───────┘
If you need a windowed version of this:
df.with_columns(
pl.col('name','value').sort_by('value').last().over('label').name.suffix("_max")
)
shape: (4, 5)
┌───────┬──────┬───────┬──────────┬───────────┐
│ label ┆ name ┆ value ┆ name_max ┆ value_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ str ┆ f64 │
╞═══════╪══════╪═══════╪══════════╪═══════════╡
│ a. ┆ foo ┆ 1.0 ┆ bar ┆ 2.0 │
│ a. ┆ bar ┆ 2.0 ┆ bar ┆ 2.0 │
│ b. ┆ baz ┆ 1.5 ┆ baz ┆ 1.5 │
│ b. ┆ boo ┆ -1.0 ┆ baz ┆ 1.5 │
└───────┴──────┴───────┴──────────┴───────────┘
You can do it using DataFrame.filter
import polars as pl
df = pl.DataFrame({
'label': ['a.', 'a.', 'b.', 'b.'],
'name': ['foo', 'bar', 'baz', 'boo'],
'value': [1, 2, 1.5, -1]
})
res = df.filter(
pl.col('value') == pl.max('value').over('label')
)
Output:
>>> res
shape: (2, 3)
┌───────┬──────┬───────┐
│ label ┆ name ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪══════╪═══════╡
│ a. ┆ bar ┆ 2.0 │
│ b. ┆ baz ┆ 1.5 │
└───────┴──────┴───────┘
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With