I have two Dataframes - graph
and search
with the same schema
Schema for graph:
SCHEMA = {
START_RANGE: pl.Int64,
END_RANGE: pl.Int64,
}
Schema for search:
SCHEMA = {
START: pl.Int64,
END: pl.Int64,
}
I want to search the graph dataframe.
For every row in search dataframe, I want to find all the rows in the graph dataframe where the START_RANGE
and END_RANGE
values are strictly within the range of the search
dataframe.
How to achieve this using polars only?
Example -
graph = pl.DataFrame(
{
"START_RANGE": [1, 10, 20, 30],
"END_RANGE": [5, 15, 25, 35],
},
)
search = pl.DataFrame(
{
"START": [2, 11, 7],
"END": [5, 14, 12],
},
)
# Expected output
[2,5] is in range [1,5] -> Where 2>=1 and 5<=5
[11,14] is in range [10,15] -> Where 11>=10 and 14<=15
[7,12] is not in any range
output = pl.DataFrame(
{
"START_RANGE": [1, 10],
"END_RANGE": [5, 15],
},
)
Polars doesn't have full range joins but it does have join_asof
which you can combine with a filter to get better performance than a full cross join filter. It's a left join so it'll have a return for every row in the left df.
It would look like
(
search
.sort('START')
.join_asof(
graph.sort('START_RANGE'),
right_on='START_RANGE', left_on='START'
)
.filter(pl.col('END')<=pl.col('END_RANGE'))
.select('START_RANGE','END_RANGE')
)
shape: (2, 2)
┌─────────────┬───────────┐
│ START_RANGE ┆ END_RANGE │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════════════╪═══════════╡
│ 1 ┆ 5 │
│ 10 ┆ 15 │
└─────────────┴───────────┘
The join_asof
ensures that START
>=START_RANGE
and then the post filter ensures that END
<=END_RANGE
You can also approach it from the opposite angle like
(
graph.sort('START_RANGE')
.join_asof(
search.sort('START'),
left_on ='START_RANGE', right_on='START',
strategy='forward'
)
.filter(pl.col('START')>=pl.col('START_RANGE'))
.select('START_RANGE','END_RANGE')
)
Depending on your data, one of those will be better than the other. For instance if you have
search = pl.DataFrame(
{
"START": [2, 3, 11, 7],
"END": [5, 4, 14, 12],
},
)
then the second approach returns the same thing as the original search
but the first approach will return an extra row of (1, 5) because of the (3,4) row.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With