Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Polars - Filter DataFrame using another DataFrame's row's

I have two Dataframes - graph and search with the same schema

Schema for graph:

SCHEMA = {
    START_RANGE: pl.Int64,
    END_RANGE: pl.Int64,
}
Schema for search:

SCHEMA = {
    START: pl.Int64,
    END: pl.Int64,
}

I want to search the graph dataframe. For every row in search dataframe, I want to find all the rows in the graph dataframe where the START_RANGE and END_RANGE values are strictly within the range of the search dataframe.

How to achieve this using polars only?

Example -

graph = pl.DataFrame(
    {
        "START_RANGE": [1, 10, 20, 30],
        "END_RANGE": [5, 15, 25, 35],
    },
)

search = pl.DataFrame(
    {
        "START": [2, 11, 7],
        "END": [5, 14, 12], 
    },
)
# Expected output

[2,5] is in range [1,5]      ->    Where 2>=1   and 5<=5
[11,14] is in range [10,15]  ->    Where 11>=10 and 14<=15
[7,12] is not in any range 

output = pl.DataFrame(
    {
        "START_RANGE": [1, 10],
        "END_RANGE": [5, 15],
    },
)
like image 203
Ashmeet Lamba Avatar asked Oct 18 '25 14:10

Ashmeet Lamba


1 Answers

Polars doesn't have full range joins but it does have join_asof which you can combine with a filter to get better performance than a full cross join filter. It's a left join so it'll have a return for every row in the left df.

It would look like

(
    search
    .sort('START')
    .join_asof(
        graph.sort('START_RANGE'), 
        right_on='START_RANGE', left_on='START'
        )
    .filter(pl.col('END')<=pl.col('END_RANGE'))
    .select('START_RANGE','END_RANGE')
)
shape: (2, 2)
┌─────────────┬───────────┐
│ START_RANGE ┆ END_RANGE │
│ ---         ┆ ---       │
│ i64         ┆ i64       │
╞═════════════╪═══════════╡
│ 1           ┆ 5         │
│ 10          ┆ 15        │
└─────────────┴───────────┘

The join_asof ensures that START>=START_RANGE and then the post filter ensures that END<=END_RANGE

You can also approach it from the opposite angle like

(
    
    graph.sort('START_RANGE')
    .join_asof(
        search.sort('START'), 
       left_on ='START_RANGE', right_on='START',
       strategy='forward'
        )
    .filter(pl.col('START')>=pl.col('START_RANGE'))
    .select('START_RANGE','END_RANGE')
)

Depending on your data, one of those will be better than the other. For instance if you have

search = pl.DataFrame(
    {
        "START": [2, 3, 11, 7],
        "END": [5, 4, 14, 12], 
    },
)

then the second approach returns the same thing as the original search but the first approach will return an extra row of (1, 5) because of the (3,4) row.

like image 95
Dean MacGregor Avatar answered Oct 21 '25 04:10

Dean MacGregor