Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count overlapping occurrences of substring in strings in Polars

What is the best way to count the number of overlapping occurrences of a given substring in strings stored in a column of a Polars DataFrame?

The Polars API provides polars.Expr.str.count_matches that counts all successive non-overlapping regex matches.

For example:

import polars as pl

df = pl.DataFrame({"foo": ["aaaaa", "aabaa", "aaaab"]})

df.with_columns(pl.col("foo").str.count_matches("aa"))
shape: (3, 1)
┌─────┐
│ foo │
│ --- │
│ u32 │
╞═════╡
│ 2   │
│ 2   │
│ 2   │
└─────┘

How to get the counts of overlapping occurrences? The expected result in this case is [4,2,3].

like image 719
RastO Avatar asked Nov 01 '25 17:11

RastO


1 Answers

Another way to approach this problem is to group/partition your strings into chunks of size 2 and operate on those chunks. The approach I have taken here requires 2 passes on the data since I pre-evaluate the most_chars variable.

import polars as pl
from polars import col, selectors as cs

df = pl.DataFrame({"foo": ["aaaaa", "aabaa", "aaaab"]})
most_chars = df.select(col('foo').str.len_chars().max()).item()

print(
    df.select(
        col('foo').str.slice(i, 2).alias(str(i)) for i in range(most_chars-1)
    )
    .select(
        df, count=pl.sum_horizontal(cs.all().str.count_matches('aa'))
    )
)
shape: (3, 2)
┌───────┬───────┐
│ foo   ┆ count │
│ ---   ┆ ---   │
│ str   ┆ u32   │
╞═══════╪═══════╡
│ aaaaa ┆ 4     │
│ aabaa ┆ 2     │
│ aaaab ┆ 3     │
└───────┴───────┘
like image 138
Cameron Riddell Avatar answered Nov 04 '25 10:11

Cameron Riddell