Polars - Filter DataFrame using another DataFrame's row's

Question

I have two Dataframes - graph and search with the same schema

Schema for graph:

SCHEMA = {
    START_RANGE: pl.Int64,
    END_RANGE: pl.Int64,
}

Schema for search:

SCHEMA = {
    START: pl.Int64,
    END: pl.Int64,
}

I want to search the graph dataframe. For every row in search dataframe, I want to find all the rows in the graph dataframe where the START_RANGE and END_RANGE values are strictly within the range of the search dataframe.

How to achieve this using polars only?

Example -

graph = pl.DataFrame(
    {
        "START_RANGE": [1, 10, 20, 30],
        "END_RANGE": [5, 15, 25, 35],
    },
)

search = pl.DataFrame(
    {
        "START": [2, 11, 7],
        "END": [5, 14, 12], 
    },
)

# Expected output

[2,5] is in range [1,5]      ->    Where 2>=1   and 5<=5
[11,14] is in range [10,15]  ->    Where 11>=10 and 14<=15
[7,12] is not in any range 

output = pl.DataFrame(
    {
        "START_RANGE": [1, 10],
        "END_RANGE": [5, 15],
    },
)

Dean MacGregor · Accepted Answer

Polars doesn't have full range joins but it does have join_asof which you can combine with a filter to get better performance than a full cross join filter. It's a left join so it'll have a return for every row in the left df.

It would look like

(
    search
    .sort('START')
    .join_asof(
        graph.sort('START_RANGE'), 
        right_on='START_RANGE', left_on='START'
        )
    .filter(pl.col('END')<=pl.col('END_RANGE'))
    .select('START_RANGE','END_RANGE')
)
shape: (2, 2)
┌─────────────┬───────────┐
│ START_RANGE ┆ END_RANGE │
│ ---         ┆ ---       │
│ i64         ┆ i64       │
╞═════════════╪═══════════╡
│ 1           ┆ 5         │
│ 10          ┆ 15        │
└─────────────┴───────────┘

The join_asof ensures that START>=START_RANGE and then the post filter ensures that END<=END_RANGE

You can also approach it from the opposite angle like

(
    
    graph.sort('START_RANGE')
    .join_asof(
        search.sort('START'), 
       left_on ='START_RANGE', right_on='START',
       strategy='forward'
        )
    .filter(pl.col('START')>=pl.col('START_RANGE'))
    .select('START_RANGE','END_RANGE')
)

Depending on your data, one of those will be better than the other. For instance if you have

search = pl.DataFrame(
    {
        "START": [2, 3, 11, 7],
        "END": [5, 4, 14, 12], 
    },
)

then the second approach returns the same thing as the original search but the first approach will return an extra row of (1, 5) because of the (3,4) row.

Polars - Filter DataFrame using another DataFrame's row's

Tags:

python

dataframe

python-polars

Ashmeet Lamba

1 Answers

Dean MacGregor

Recent Activity

Donate For Us

Polars - Filter DataFrame using another DataFrame's row's

Tags:

python

dataframe

python-polars

Ashmeet Lamba

1 Answers

Dean MacGregor

Related questions

Recent Activity

Donate For Us