Find max value in a range of a given column in Polars?

Question

I have the following dataframe:

df = pl.DataFrame({
    "Column A": [2, 3, 1, 4, 1, 3, 3, 2, 1, 0],
    "Column B": [
        "Life", None, None, None, "Death", None, 
        "Life", None, None, "Death"
    ]
})

shape: (10, 2)
┌──────────┬──────────┐
│ Column A ┆ Column B │
│ ---      ┆ ---      │
│ i64      ┆ str      │
╞══════════╪══════════╡
│ 2        ┆ Life     │
│ 3        ┆ null     │
│ 1        ┆ null     │
│ 4        ┆ null     │
│ 1        ┆ Death    │
│ 3        ┆ null     │
│ 3        ┆ Life     │
│ 2        ┆ null     │
│ 1        ┆ null     │
│ 0        ┆ Death    │
└──────────┴──────────┘

I want to create a new column, let's call it Column C. For each row where Column B is 'Life', Column C should have the maximum value in the range of values in Column A from that row until the row where Column B is 'Death'. In cases where Column B is not 'Life', Column C should be set to 'None'

The end result should look like this:

shape: (10, 3)
┌──────────┬──────────┬──────────┐
│ Column A ┆ Column B ┆ Column C │
│ ---      ┆ ---      ┆ ---      │
│ i64      ┆ str      ┆ f64      │
╞══════════╪══════════╪══════════╡
│ 2        ┆ Life     ┆ 4.0      │
│ 3        ┆ null     ┆ null     │
│ 1        ┆ null     ┆ null     │
│ 4        ┆ null     ┆ null     │
│ 1        ┆ Death    ┆ null     │
│ 3        ┆ null     ┆ null     │
│ 3        ┆ Life     ┆ 3.0      │
│ 2        ┆ null     ┆ null     │
│ 1        ┆ null     ┆ null     │
│ 0        ┆ Death    ┆ null     │
└──────────┴──────────┴──────────┘

How can I achieve this using Polars in Python? Any help or suggestions would be appreciated!

jqurious · Accepted Answer

I think the general idea is to assign "group ids" to each "range".

A common approach for this is to use cumulative sum along with forward filling.

(
    df.with_columns(
        start = (pl.col("Column B") == "Life").cum_sum().forward_fill(),
        end = (pl.col("Column B") == "Death").cum_sum().forward_fill()
    )
    .with_columns(
        group_id_1 = pl.col("start") + pl.col("end")
    )
    .with_columns(
        group_id_2 = 
            pl.when(pl.col("Column B") == "Death")
              .then(pl.col("group_id_1").shift())
              .otherwise(pl.col("group_id_1"))
    )
)

shape: (10, 6)
┌──────────┬──────────┬───────┬─────┬────────────┬────────────┐
│ Column A ┆ Column B ┆ start ┆ end ┆ group_id_1 ┆ group_id_2 │
│ ---      ┆ ---      ┆ ---   ┆ --- ┆ ---        ┆ ---        │
│ i64      ┆ str      ┆ u32   ┆ u32 ┆ u32        ┆ u32        │
╞══════════╪══════════╪═══════╪═════╪════════════╪════════════╡
│ 2        ┆ Life     ┆ 1     ┆ 0   ┆ 1          ┆ 1          │
│ 3        ┆ null     ┆ 1     ┆ 0   ┆ 1          ┆ 1          │
│ 1        ┆ null     ┆ 1     ┆ 0   ┆ 1          ┆ 1          │
│ 4        ┆ null     ┆ 1     ┆ 0   ┆ 1          ┆ 1          │
│ 1        ┆ Death    ┆ 1     ┆ 1   ┆ 2          ┆ 1          │ # 2 -> 1
│ 3        ┆ null     ┆ 1     ┆ 1   ┆ 2          ┆ 2          │
│ 3        ┆ Life     ┆ 2     ┆ 1   ┆ 3          ┆ 3          │
│ 2        ┆ null     ┆ 2     ┆ 1   ┆ 3          ┆ 3          │
│ 1        ┆ null     ┆ 2     ┆ 1   ┆ 3          ┆ 3          │
│ 0        ┆ Death    ┆ 2     ┆ 2   ┆ 4          ┆ 3          │ # 4 -> 3
└──────────┴──────────┴───────┴─────┴────────────┴────────────┘

group_id_1 gets most of the way there apart from the Death rows which need to be shifted to produce group_id_2

As it is sufficiently complex you may want to use variables and/or a function to build the final expression:

start = pl.col("Column B") == "Life"
end   = pl.col("Column B") == "Death"

group_id = (start.cum_sum() + end.cum_sum()).forward_fill() # id_1
group_id = (                                                # id_2
   pl.when(end)
     .then(group_id.shift())
     .otherwise(group_id)
)

# Insert the max over each group into each Life row
df.with_columns(
   pl.when(start) 
     .then(pl.col("Column A").max().over(group_id))
     .alias("Column C")
)

shape: (10, 3)
┌──────────┬──────────┬──────────┐
│ Column A ┆ Column B ┆ Column C │
│ ---      ┆ ---      ┆ ---      │
│ i64      ┆ str      ┆ i64      │
╞══════════╪══════════╪══════════╡
│ 2        ┆ Life     ┆ 4        │
│ 3        ┆ null     ┆ null     │
│ 1        ┆ null     ┆ null     │
│ 4        ┆ null     ┆ null     │
│ 1        ┆ Death    ┆ null     │
│ 3        ┆ null     ┆ null     │
│ 3        ┆ Life     ┆ 3        │
│ 2        ┆ null     ┆ null     │
│ 1        ┆ null     ┆ null     │
│ 0        ┆ Death    ┆ null     │
└──────────┴──────────┴──────────┘

Find max value in a range of a given column in Polars?

Tags:

python

python-polars

Levi Santos

1 Answers

jqurious

Recent Activity

Donate For Us

Find max value in a range of a given column in Polars?

Tags:

python

python-polars

Levi Santos

1 Answers

jqurious

Related questions

Recent Activity

Donate For Us