I'm wondering if there's a lightweight syntax for filtering a polars DataFrame against a multi-column key, other than inner/anti joins. (There's nothing wrong with the joins, but it would be nice if there's something more compact).
Using the following frame as an example:
import polars as pl
df = pl.DataFrame(
data = [
["x",123, 4.5, "misc"],
["y",456,10.0,"other"],
["z",789,99.5,"value"],
],
schema = ["a","b","c","d"],
orient = "row",
)
A PostgreSQL statement could use a VALUES expression, like so...
(("a","b") IN (VALUES ('x',123),('y',456)))
...and a pandas equivalent might set a multi-column index.
pf.set_index( ["a","b"], inplace=True )
pf[ pf.index.isin([('x',123),('y',456)]) ]
The polars syntax would look like this:
df.join(
pl.DataFrame(
data = [('x',123),('y',456)],
schema = {col:tp for col,tp in df.schema.items() if col in ("a","b")},
orient = "row",
),
on = ["a","b"],
how = "inner", # or 'anti' for "not in"
)
Is a multi-column is_in construct, or equivalent expression, currently available with polars? Something like the following would be great if it exists (or could be added):
df.filter( pl.cols("a","b").is_in([('x',123),('y',456)]) )
In the next polars release >0.13.44 this will work on the struct datatype.
We convert the 2 (or more) columns we want to check to a struct with pl.struct and call the is_in expression. (A conversion to struct is a free operation)
df = pl.DataFrame(
data=[
["x", 123, 4.5, "misc"],
["y", 456, 10.0, "other"],
["z", 789, 99.5, "value"],
],
schema=["a", "b", "c", "d"],
orient="row",
)
df.filter(
pl.struct("a", "b").is_in([{"a": "x", "b": 123}, {"a": "y", "b": 456}])
)
shape: (2, 4)
┌─────┬─────┬──────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ str │
╞═════╪═════╪══════╪═══════╡
│ x ┆ 123 ┆ 4.5 ┆ misc │
│ y ┆ 456 ┆ 10.0 ┆ other │
└─────┴─────┴──────┴───────┘
The idiomatic way to filter data by presence in another DataFrame are semi and anti joins. Inner joins also filter by presence, but they include the columns of the right hand DataFrame, where a semi join does not and only filters the left hand side.
The reason why these joins are preferred over is_in is that they are much faster and currently allow for more optimization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With