Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Idiomatic replacement of empty string with null in Polars

I have a polars DataFrame with a number of Series that look like:

pl.Series(['cow', 'cat', '', 'lobster', ''])
# Series: '' [str]
# [
#   "cow"
#   "cat"
#   ""
#   "lobster"
#   ""
# ]

and I'd like them to be

pl.Series(['cow', 'cat', None, 'lobster', None])
# Series: '' [str]
# [
#   "cow"
#   "cat"
#   null
#   "lobster"
#   null
# ]

A simple string replacement gives me an error:

pl.Series(['cow', 'cat', '', 'lobster', '']).str.replace('', None)

ComputeError: value cannot be 'null' in 'replace' expression

What's the idiomatic way of doing this for a Series/DataFrame in polars?


2 Answers

Series

For a single Series, you can use the set method.

s = pl.Series(["cow", "cat", "", "lobster", ""])
s.set(s.str.len_chars() == 0, None)
shape: (5,)
Series: '' [str]
[
        "cow"
        "cat"
        null
        "lobster"
        null
]

DataFrame

For DataFrames, I would suggest using when/then/otherwise. For example, with this data:

df = pl.DataFrame({
    "str1": ["cow", "dog", "", "lobster", ""],
    "str2": ["", "apple", "orange", "", "kiwi"],
    "str3": ["house", "", "apartment", "condo", ""],
})
shape: (5, 3)
┌─────────┬────────┬───────────┐
│ str1    ┆ str2   ┆ str3      │
│ ---     ┆ ---    ┆ ---       │
│ str     ┆ str    ┆ str       │
╞═════════╪════════╪═══════════╡
│ cow     ┆        ┆ house     │
│ dog     ┆ apple  ┆           │
│         ┆ orange ┆ apartment │
│ lobster ┆        ┆ condo     │
│         ┆ kiwi   ┆           │
└─────────┴────────┴───────────┘

We can run a replacement on all string columns as follows:

df.with_columns(
    pl.when(pl.col(pl.String).str.len_chars() == 0)
    .then(None)
    .otherwise(pl.col(pl.String))
    .name.keep()
)
shape: (5, 3)
┌─────────┬────────┬───────────┐
│ str1    ┆ str2   ┆ str3      │
│ ---     ┆ ---    ┆ ---       │
│ str     ┆ str    ┆ str       │
╞═════════╪════════╪═══════════╡
│ cow     ┆ null   ┆ house     │
│ dog     ┆ apple  ┆ null      │
│ null    ┆ orange ┆ apartment │
│ lobster ┆ null   ┆ condo     │
│ null    ┆ kiwi   ┆ null      │
└─────────┴────────┴───────────┘

The above should be fairly performant.

If you only want to replace empty strings with null on certain columns, you can provide a list:

only_these = ["str1", "str2"]
df.with_columns(
    pl.when(pl.col(only_these).str.len_chars() == 0)
    .then(None)
    .otherwise(pl.col(only_these))
    .name.keep()
)
shape: (5, 3)
┌─────────┬────────┬───────────┐
│ str1    ┆ str2   ┆ str3      │
│ ---     ┆ ---    ┆ ---       │
│ str     ┆ str    ┆ str       │
╞═════════╪════════╪═══════════╡
│ cow     ┆ null   ┆ house     │
│ dog     ┆ apple  ┆           │
│ null    ┆ orange ┆ apartment │
│ lobster ┆ null   ┆ condo     │
│ null    ┆ kiwi   ┆           │
└─────────┴────────┴───────────┘

To replace empty strings in a string series.

series.replace("", None)

To replace empty strings in a specific dataframe column "A".

df.with_columns(pl.col("A").replace("", None))

To replace empty strings in all string columns of a dataframe.

df.with_columns(pl.col(pl.String).replace("", None))
like image 36
Hericks Avatar answered Dec 15 '25 08:12

Hericks