I have a polars dataframe that contain some ID, actions, and values :
Example Dataframe:
data = {
"ID" : [1, 1, 2,2,3,3],
"Action" : ["A", "A", "B", "B", "A", "A"],
"Where" : ["Office", "Home", "Home", "Office", "Home", "Home"],
"Value" : [1, 2, 3, 4, 5, 6]
}
df = pl.DataFrame(data)
I want to select for each ID and action the biggest value, so i know where he rather do the action.
I'm taking the following approach :
(
df
.select(
pl.col("ID"),
pl.col("Action"),
pl.col("Where"),
TOP = pl.col("Value").max().over(["ID", "Action"]))
)
After that , i sorted the values and keep the unique values (The first one) to maintain the desired info, however the input its incorrect :
(
df
.select(
pl.col("ID"),
pl.col("Action"),
pl.col("Where"),
TOP = pl.col("Value").max().over(["ID", "Action"]))
.sort(
pl.col("*"), descending =True
)
.unique(
subset = ["ID", "Action"],
maintain_order = True,
keep = "first"
)
)
Current Output :
shape: (3, 4)
┌─────┬────────┬────────┬─────┐
│ ID ┆ Action ┆ Where ┆ TOP │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════╪════════╪════════╪═════╡
│ 3 ┆ A ┆ Home ┆ 6 │
│ 2 ┆ B ┆ Office ┆ 4 │
│ 1 ┆ A ┆ Office ┆ 2 │
└─────┴────────┴────────┴─────┘
Expected Output:
shape: (3, 4)
┌─────┬────────┬────────┬─────┐
│ ID ┆ Action ┆ Where ┆ TOP │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════╪════════╪════════╪═════╡
│ 3 ┆ A ┆ Home ┆ 6 │
│ 2 ┆ B ┆ Office ┆ 4 │
│ 1 ┆ A ┆ Home ┆ 2 │
└─────┴────────┴────────┴─────┘
Also, i think this approach its not the optimal way
The over and unique could be combined into a group_by
.arg_max()
can give you the index of the max.get()
will extract the corresponding values at that index(df.group_by("ID", "Action")
.agg(
pl.all().get(pl.col("Value").arg_max())
)
)
shape: (3, 4)
┌─────┬────────┬────────┬───────┐
│ ID ┆ Action ┆ Where ┆ Value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════╪════════╪════════╪═══════╡
│ 1 ┆ A ┆ Home ┆ 2 │
│ 2 ┆ B ┆ Office ┆ 4 │
│ 3 ┆ A ┆ Home ┆ 6 │
└─────┴────────┴────────┴───────┘
You can use the following builtin in function in polars to do this easily.
with_columns
: this provides you with with the information based on col(value) over both ID
and Action
taken.
sort
: this just sorts in descending order
unique
: this allows you to drop all duplicates
import polars as pl
data = {
"ID": [1, 1, 2, 2, 3, 3],
"Action": ["A", "A", "B", "B", "A", "A"],
"Where": ["Office", "Home", "Home", "Office", "Home", "Home"],
"Value": [1, 2, 3, 4, 5, 6]
}
df = pl.DataFrame(data)
result = (
df
.with_columns(
TOP=pl.col("Value").max().over(["ID", "Action"])
)
.sort(by="TOP", descending=True)
.unique(subset=["ID", "Action"], maintain_order=True, keep="first")
)
print(result)
shape: (3, 5)
┌─────┬────────┬────────┬───────┬─────┐
│ ID ┆ Action ┆ Where ┆ Value ┆ TOP │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 ┆ i64 │
╞═════╪════════╪════════╪═══════╪═════╡
│ 3 ┆ A ┆ Home ┆ 5 ┆ 6 │
│ 2 ┆ B ┆ Home ┆ 3 ┆ 4 │
│ 1 ┆ A ┆ Office ┆ 1 ┆ 2 │
└─────┴────────┴────────┴───────┴─────┘
Alternatively you can go with groupby
and filter
which eliminates the use of unique even tho they sound the same
result = (
df.group_by(["ID", "Action"]).agg(
[
pl.col("Value").max().alias("TOP"),
pl.col("Where")
.filter(pl.col("Value") == pl.col("Value").max())
.first()
.alias("Where"),
]
)
.sort("TOP", descending=True)
)
another result:
shape: (3, 4)
┌─────┬────────┬─────┬────────┐
│ ID ┆ Action ┆ TOP ┆ Where │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str │
╞═════╪════════╪═════╪════════╡
│ 3 ┆ A ┆ 6 ┆ Home │
│ 2 ┆ B ┆ 4 ┆ Office │
│ 1 ┆ A ┆ 2 ┆ Home │
└─────┴────────┴─────┴────────┘
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With