I have the following dataframe:
df = pl.DataFrame({
"Column A": [2, 3, 1, 4, 1, 3, 3, 2, 1, 0],
"Column B": [
"Life", None, None, None, "Death", None,
"Life", None, None, "Death"
]
})
shape: (10, 2)
┌──────────┬──────────┐
│ Column A ┆ Column B │
│ --- ┆ --- │
│ i64 ┆ str │
╞══════════╪══════════╡
│ 2 ┆ Life │
│ 3 ┆ null │
│ 1 ┆ null │
│ 4 ┆ null │
│ 1 ┆ Death │
│ 3 ┆ null │
│ 3 ┆ Life │
│ 2 ┆ null │
│ 1 ┆ null │
│ 0 ┆ Death │
└──────────┴──────────┘
I want to create a new column, let's call it Column C. For each row where Column B is 'Life', Column C should have the maximum value in the range of values in Column A from that row until the row where Column B is 'Death'. In cases where Column B is not 'Life', Column C should be set to 'None'
The end result should look like this:
shape: (10, 3)
┌──────────┬──────────┬──────────┐
│ Column A ┆ Column B ┆ Column C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 │
╞══════════╪══════════╪══════════╡
│ 2 ┆ Life ┆ 4.0 │
│ 3 ┆ null ┆ null │
│ 1 ┆ null ┆ null │
│ 4 ┆ null ┆ null │
│ 1 ┆ Death ┆ null │
│ 3 ┆ null ┆ null │
│ 3 ┆ Life ┆ 3.0 │
│ 2 ┆ null ┆ null │
│ 1 ┆ null ┆ null │
│ 0 ┆ Death ┆ null │
└──────────┴──────────┴──────────┘
How can I achieve this using Polars in Python? Any help or suggestions would be appreciated!
I think the general idea is to assign "group ids" to each "range".
A common approach for this is to use cumulative sum along with forward filling.
(
df.with_columns(
start = (pl.col("Column B") == "Life").cum_sum().forward_fill(),
end = (pl.col("Column B") == "Death").cum_sum().forward_fill()
)
.with_columns(
group_id_1 = pl.col("start") + pl.col("end")
)
.with_columns(
group_id_2 =
pl.when(pl.col("Column B") == "Death")
.then(pl.col("group_id_1").shift())
.otherwise(pl.col("group_id_1"))
)
)
shape: (10, 6)
┌──────────┬──────────┬───────┬─────┬────────────┬────────────┐
│ Column A ┆ Column B ┆ start ┆ end ┆ group_id_1 ┆ group_id_2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞══════════╪══════════╪═══════╪═════╪════════════╪════════════╡
│ 2 ┆ Life ┆ 1 ┆ 0 ┆ 1 ┆ 1 │
│ 3 ┆ null ┆ 1 ┆ 0 ┆ 1 ┆ 1 │
│ 1 ┆ null ┆ 1 ┆ 0 ┆ 1 ┆ 1 │
│ 4 ┆ null ┆ 1 ┆ 0 ┆ 1 ┆ 1 │
│ 1 ┆ Death ┆ 1 ┆ 1 ┆ 2 ┆ 1 │ # 2 -> 1
│ 3 ┆ null ┆ 1 ┆ 1 ┆ 2 ┆ 2 │
│ 3 ┆ Life ┆ 2 ┆ 1 ┆ 3 ┆ 3 │
│ 2 ┆ null ┆ 2 ┆ 1 ┆ 3 ┆ 3 │
│ 1 ┆ null ┆ 2 ┆ 1 ┆ 3 ┆ 3 │
│ 0 ┆ Death ┆ 2 ┆ 2 ┆ 4 ┆ 3 │ # 4 -> 3
└──────────┴──────────┴───────┴─────┴────────────┴────────────┘
group_id_1
gets most of the way there apart from the Death rows which need to be shifted to produce group_id_2
As it is sufficiently complex you may want to use variables and/or a function to build the final expression:
start = pl.col("Column B") == "Life"
end = pl.col("Column B") == "Death"
group_id = (start.cum_sum() + end.cum_sum()).forward_fill() # id_1
group_id = ( # id_2
pl.when(end)
.then(group_id.shift())
.otherwise(group_id)
)
# Insert the max over each group into each Life row
df.with_columns(
pl.when(start)
.then(pl.col("Column A").max().over(group_id))
.alias("Column C")
)
shape: (10, 3)
┌──────────┬──────────┬──────────┐
│ Column A ┆ Column B ┆ Column C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞══════════╪══════════╪══════════╡
│ 2 ┆ Life ┆ 4 │
│ 3 ┆ null ┆ null │
│ 1 ┆ null ┆ null │
│ 4 ┆ null ┆ null │
│ 1 ┆ Death ┆ null │
│ 3 ┆ null ┆ null │
│ 3 ┆ Life ┆ 3 │
│ 2 ┆ null ┆ null │
│ 1 ┆ null ┆ null │
│ 0 ┆ Death ┆ null │
└──────────┴──────────┴──────────┘
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With