Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to drop certain elements from a list typed column in Polars?

Suppose I have polars Dataframe with a list column type of strings:

┌─────────────────────────────────────────────────┐
│ words                                           │
│ ---                                             │
│ list[str]                                       │
╞═════════════════════════════════════════════════╡
│ ["i", "like", "the", "pizza"]                   │
│ ["the", "dog", "is", "runnig"]                  │
│ ["me", "and", "my", "friend", "are", "playing"] │
└─────────────────────────────────────────────────┘

And I would like to filter stop words from every list.

I can apply some custom function using map_elements:

import polars as pl

pl.Config(fmt_table_cell_list_len=8, fmt_str_lengths=80)

df = pl.DataFrame({
    "words": [["i", "like", "the", "pizza"],
              ["the", "dog", "is", "runnig"],
              ["me", "and", "my", "friend", "are", "playing"]]
})

STOP_WORDS = ["the"]

filtered_df = df.with_columns(
    pl.col("words").map_elements(lambda words: 
        [word for word in words if word not in STOP_WORDS]
    )
)
shape: (3, 1)
┌─────────────────────────────────────────────────┐
│ words                                           │
│ ---                                             │
│ list[str]                                       │
╞═════════════════════════════════════════════════╡
│ ["i", "like", "pizza"]                          │
│ ["dog", "is", "runnig"]                         │
│ ["me", "and", "my", "friend", "are", "playing"] │
└─────────────────────────────────────────────────┘

However, it stated in the docs that custom UDFs are much slower, so I prefer native API based solution.

Is there any builtin function in Polars to achieve my goal?

Thanks.

like image 833
barak1412 Avatar asked Oct 12 '25 16:10

barak1412


1 Answers

.list.set_difference() may also be an option.

df.with_columns(
    pl.col("words").list.set_difference(STOP_WORDS)
)
shape: (3, 1)
┌─────────────────────────────────────────────────┐
│ words                                           │
│ ---                                             │
│ list[str]                                       │
╞═════════════════════════════════════════════════╡
│ ["i", "like", "pizza"]                          │
│ ["runnig", "dog", "is"]                         │
│ ["me", "and", "my", "friend", "are", "playing"] │
└─────────────────────────────────────────────────┘

Do note that the "set" approach also removes duplicates which may or may not be desired.

pl.Series([["a", "a", "a", "b", "c"]]).list.set_difference(["c", "d"])
shape: (1,)
Series: '' [list[str]]
[
    ["a", "b"]
]
like image 199
jqurious Avatar answered Oct 14 '25 05:10

jqurious



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!