Given a Polars DataFrame, is there a way to replace a particular value by "null"? For example, if there's a sentinel value like "_UNKNOWN" and I want to make it truly missing in the dataframe instead.
Update:
Expr.replace() has also since been added to Polars.
df.with_columns(pl.col(pl.String).replace("_UNKNOWN", None))
shape: (4, 3)
┌──────┬──────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞══════╪══════╪═════╡
│ a ┆ null ┆ 1 │
│ b ┆ d ┆ 2 │
│ null ┆ e ┆ 3 │
│ c ┆ f ┆ 4 │
└──────┴──────┴─────┘
You can use .when().then().otherwise()
pl.col(pl.String) is used to select all "string columns".
df = pl.DataFrame({
"A": ["a", "b", "_UNKNOWN", "c"],
"B": ["_UNKNOWN", "d", "e", "f"],
"C": [1, 2, 3, 4]
})
df.with_columns(
pl.when(pl.col(pl.String) == "_UNKNOWN")
.then(None)
.otherwise(pl.col(pl.String)) # keep original value
.name.keep()
)
This is really a tweak of @jqurious's answer.
When you do a when and your condition isn't met then the default is null so you can just do:
df.with_columns(
pl.when(pl.col(pl.String) != "_UNKNOWN")
.then(pl.col(pl.String)) # keep original value
.name.keep()
)
If you have multiple null conditions say null_strings=["_UNKNOWN", "UNK", "who_knows"] then you can use a is_in like this:
df.with_columns(
pl.when(~pl.col(pl.String).is_in(null_strings))
.then(pl.col(pl.String)) # keep original value
.name.keep()
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With