Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select the longest string from a list of strings in polars?

How do I select the longest string from a list of strings in polars?

Example and expected output:

import polars as pl

df = pl.DataFrame({
    "values": [
        ["the", "quickest", "brown", "fox"],
        ["jumps", "over", "the", "lazy", "dog"],
        []
    ]
})
┌──────────────────────────────┬────────────────┐
│ values                       ┆ longest_string │
│ ---                          ┆ ---            │
│ list[str]                    ┆ str            │
╞══════════════════════════════╪════════════════╡
│ ["the", "quickest", … "fox"] ┆ quickest       │
│ ["jumps", "over", … "dog"]   ┆ jumps          │
│ []                           ┆ null           │
└──────────────────────────────┴────────────────┘

My use case is to select the longest overlapping match.

Edit: elaborating on the longest overlapping match, this is the output for the example provided by polars:

┌────────────┬───────────┬─────────────────────────────────┐
│ values     ┆ matches   ┆ matches_overlapping             │
│ ---        ┆ ---       ┆ ---                             │
│ str        ┆ list[str] ┆ list[str]                       │
╞════════════╪═══════════╪═════════════════════════════════╡
│ discontent ┆ ["disco"] ┆ ["disco", "onte", "discontent"] │
└────────────┴───────────┴─────────────────────────────────┘

I desire a way to select the longest match in matches_overlapping.

like image 777
conjuncts Avatar asked Sep 14 '25 12:09

conjuncts


1 Answers

You can do something like:

df.with_columns(
    pl.col('values').list.get(
        pl.col('values')
        .list.eval(pl.element().str.len_chars())
        .list.arg_max()
    )
    .alias('longest_string')
)

This expression:

pl.col('values')
.list.eval(pl.element().str.len_chars())
.list.arg_max()

first maps len_chars to each string in each of the lists with .list.eval, then it finds the arg_max (the index of the max element, so in this case, the index of the max length).

The result of that is passed to list.get to retrieve those values.

like image 135
juanpa.arrivillaga Avatar answered Sep 17 '25 03:09

juanpa.arrivillaga