When implementing a binary flag column in Python polars v0.15.15, I came across some seemingly weird behavior. Given a df
import polars as pl
df = pl.DataFrame({
"col1": [0,1,2,3],
"flag": [0,0,0,0]
})
I set the flag by or-ing the current flag value, e.g. 2
df = df.with_column(
pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
.then(pl.col("flag") | 2) # set flag b0010
.otherwise(pl.col("flag"))
)
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 0 ┆ 2 │
│ 1 ┆ 0 │
│ 2 ┆ 0 │
│ 3 ┆ 2 │
└──────┴──────┘
So far so good, however when adding another flag, I get something unexpected:
df = df.with_column(
pl.when(pl.col("col1") > -1)
.then(pl.col("flag") | 4) # also set flag b0100
.otherwise(pl.col("flag"))
)
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 0 ┆ 6 │
│ 1 ┆ 6 │ # <-- ?! 0 | 4 is 4, not 6
│ 2 ┆ 6 │ # <-- ?! 0 | 4 is 4, not 6
│ 3 ┆ 6 │
└──────┴──────┘
Why are all flags now 6? I'd expect [6, 4, 4, 6]
Doing it the other way around (set flag 4, then flag 2), the result is as expected:
df = pl.DataFrame({"col1": [0,1,2,3], "flag": [0,0,0,0]})
df = df.with_column(
pl.when(pl.col("col1") > -1)
.then(pl.col("flag") | 4)
.otherwise(pl.col("flag"))
)
df = df.with_column(
pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
.then(pl.col("flag") | 2)
.otherwise(pl.col("flag"))
)
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 0 ┆ 6 │
│ 1 ┆ 4 │
│ 2 ┆ 4 │
│ 3 ┆ 6 │
└──────┴──────┘
What's going on here, what am I missing?
The issue is fixed; see also closed issue on github.
Work-arounds would be for example an apply (rather inefficient):
import polars as pl
df = pl.DataFrame({"col1": [0,1,2,3], "flag": [0,0,0,0]})
df = df.with_columns(
pl.when((pl.col("col1") < 1) | (pl.col("col1") >= 3))
.then(pl.col("flag").apply(lambda flag: flag | 2)) # set flag b0010
.otherwise(pl.col("flag"))
)
df = df.with_columns(
pl.when(pl.col("col1") > -1)
.then(pl.col("flag").apply(lambda flag: flag | 4)) # set/combine with flag b0100
.otherwise(pl.col("flag"))
)
print(df)
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ flag │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 0 ┆ 6 │
│ 1 ┆ 4 │
│ 2 ┆ 4 │
│ 3 ┆ 6 │
└──────┴──────┘
Or similarly np.bitwise_or (thanks @jqurious):
df.with_columns(
pl.when(condition_for_flag)
.then(np.bitwise_or(pl.col("flag"), flag_to_set))
.otherwise(pl.col("flag"))
)
or np.where instead of polar's when-then-else, then cast result back to series:
df.with_columns(
pl.Series(
np.where(condition_for_flag,
df["flag"].to_numpy() | flag_to_set,
df["flag"]
)
).alias("flag")
)
Both np.bitwise_ and np.where seem to be more efficient than the apply. While apply most likely has linear time complexity, np.bitwise_ and np.where might perform differently depending on input size. Test for your specific (typical) input size in case of doubt.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With