Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

filter polars DataFrame based on when rows whose specific columns contain pairs from a list of pairs

In this example, on columns ["foo", "ham"], I want rows 1 and 4 to be removed since they match a pair in the list

df = pl.DataFrame(
    {
        "foo": [1, 1, 2, 2, 3, 3, 4],
        "bar": [6, 7, 8, 9, 10, 11, 12],
        "ham": ["a", "b", "c", "d", "e", "f", "b"]
    }
)
pairs = [(1,"b"),(3,"e"),(4,"g")]

The following worked for me but I think this will be problematic when the dataframe and list of pairs are large.

for a, b in pairs:
    df = df.filter(~(pl.col('foo') == a) | ~(pl.col('ham') == b))

I think this is the pandas implementation for this problem Pandas: How to remove rows from a dataframe based on a list of tuples representing values in TWO columns?

I am not sure what the Polars implementation of it is.

(I think this problem can be generalized to any number of selected columns and any number of elements in a group. For instance, rather than a list of pairs, it can be another dataframe. You get the 'set difference', in terms of rows, of the two dataframes based on specific columns.)

like image 260
pikaft Avatar asked Oct 14 '25 11:10

pikaft


1 Answers

It looks like an ANTI JOIN

schema = ["foo", "ham"]

(df.with_row_index() # just to show what "row numbers" were "removed"
   .join(
       pl.DataFrame(pairs, orient="row", schema=schema), 
       on = schema, 
       how = "anti"
   )
)
shape: (5, 4)
┌───────┬─────┬─────┬─────┐
│ index ┆ foo ┆ bar ┆ ham │
│ ---   ┆ --- ┆ --- ┆ --- │
│ u32   ┆ i64 ┆ i64 ┆ str │
╞═══════╪═════╪═════╪═════╡
│ 0     ┆ 1   ┆ 6   ┆ a   │
│ 2     ┆ 2   ┆ 8   ┆ c   │
│ 3     ┆ 2   ┆ 9   ┆ d   │
│ 5     ┆ 3   ┆ 11  ┆ f   │
│ 6     ┆ 4   ┆ 12  ┆ b   │
└───────┴─────┴─────┴─────┘
like image 60
jqurious Avatar answered Oct 17 '25 03:10

jqurious