What is the best way to count the number of overlapping occurrences of a given substring in strings stored in a column of a Polars DataFrame?
The Polars API provides polars.Expr.str.count_matches that counts all successive non-overlapping regex matches.
For example:
import polars as pl
df = pl.DataFrame({"foo": ["aaaaa", "aabaa", "aaaab"]})
df.with_columns(pl.col("foo").str.count_matches("aa"))
shape: (3, 1)
┌─────┐
│ foo │
│ --- │
│ u32 │
╞═════╡
│ 2 │
│ 2 │
│ 2 │
└─────┘
How to get the counts of overlapping occurrences? The expected result in this case is [4,2,3].
Another way to approach this problem is to group/partition your strings into chunks of size 2 and operate on those chunks. The approach I have taken here requires 2 passes on the data since I pre-evaluate the most_chars variable.
import polars as pl
from polars import col, selectors as cs
df = pl.DataFrame({"foo": ["aaaaa", "aabaa", "aaaab"]})
most_chars = df.select(col('foo').str.len_chars().max()).item()
print(
df.select(
col('foo').str.slice(i, 2).alias(str(i)) for i in range(most_chars-1)
)
.select(
df, count=pl.sum_horizontal(cs.all().str.count_matches('aa'))
)
)
shape: (3, 2)
┌───────┬───────┐
│ foo ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════╪═══════╡
│ aaaaa ┆ 4 │
│ aabaa ┆ 2 │
│ aaaab ┆ 3 │
└───────┴───────┘
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With