Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly set binary flags in a Python polars dataframe

When implementing a binary flag column in Python polars v0.15.15, I came across some seemingly weird behavior. Given a df

import polars as pl

df = pl.DataFrame({
        "col1": [0,1,2,3],
        "flag": [0,0,0,0]
    })

I set the flag by or-ing the current flag value, e.g. 2

df = df.with_column(
        pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
        .then(pl.col("flag") | 2) # set flag b0010
        .otherwise(pl.col("flag"))
    )
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 0    ┆ 2    │
│ 1    ┆ 0    │
│ 2    ┆ 0    │
│ 3    ┆ 2    │
└──────┴──────┘

So far so good, however when adding another flag, I get something unexpected:

df = df.with_column(
        pl.when(pl.col("col1") > -1)  
        .then(pl.col("flag") | 4) # also set flag b0100
        .otherwise(pl.col("flag"))
    )
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 0    ┆ 6    │
│ 1    ┆ 6    │ # <-- ?! 0 | 4 is 4, not 6
│ 2    ┆ 6    │ # <-- ?! 0 | 4 is 4, not 6
│ 3    ┆ 6    │
└──────┴──────┘

Why are all flags now 6? I'd expect [6, 4, 4, 6]

Doing it the other way around (set flag 4, then flag 2), the result is as expected:

df = pl.DataFrame({"col1": [0,1,2,3], "flag": [0,0,0,0]})
df = df.with_column(
        pl.when(pl.col("col1") > -1)  
        .then(pl.col("flag") | 4)
        .otherwise(pl.col("flag"))
    )
df = df.with_column(
        pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
        .then(pl.col("flag") | 2)
        .otherwise(pl.col("flag"))
    )
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 0    ┆ 6    │
│ 1    ┆ 4    │
│ 2    ┆ 4    │
│ 3    ┆ 6    │
└──────┴──────┘

What's going on here, what am I missing?

like image 431
FObersteiner Avatar asked Feb 21 '26 09:02

FObersteiner


1 Answers

polars 0.17.14 update:

The issue is fixed; see also closed issue on github.


old answer:

Work-arounds would be for example an apply (rather inefficient):

import polars as pl

df = pl.DataFrame({"col1": [0,1,2,3], "flag": [0,0,0,0]})

df = df.with_columns(
        pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
        .then(pl.col("flag").apply(lambda flag: flag | 2)) # set flag b0010
        .otherwise(pl.col("flag"))
    )
df = df.with_columns(
        pl.when(pl.col("col1") > -1)
        .then(pl.col("flag").apply(lambda flag: flag | 4)) # set/combine with flag b0100
        .otherwise(pl.col("flag"))
    )
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 0    ┆ 6    │
│ 1    ┆ 4    │
│ 2    ┆ 4    │
│ 3    ┆ 6    │
└──────┴──────┘

Or similarly np.bitwise_or (thanks @jqurious):

df.with_columns(
        pl.when(condition_for_flag)
        .then(np.bitwise_or(pl.col("flag"), flag_to_set))
        .otherwise(pl.col("flag"))
        )

or np.where instead of polar's when-then-else, then cast result back to series:

df.with_columns(
        pl.Series(
            np.where(condition_for_flag,
                     df["flag"].to_numpy() | flag_to_set,
                     df["flag"]
            )
        ).alias("flag")
    )

Both np.bitwise_ and np.where seem to be more efficient than the apply. While apply most likely has linear time complexity, np.bitwise_ and np.where might perform differently depending on input size. Test for your specific (typical) input size in case of doubt.

like image 166
FObersteiner Avatar answered Feb 22 '26 23:02

FObersteiner