What is the best way to merge two dataframes that one of them has overlapping ranges?

Question

My DataFrames are:

import pandas as pd

df_1 = pd.DataFrame(
    {
        'a': [10, 12, 14, 20, 25, 30, 42, 50, 80]
    }
)

df_2 = pd.DataFrame(
    {
        'start': [9, 19],
        'end': [26, 50],
        'label': ['a', 'b']
    }
)

Expected output: Adding column label to df_1:

df_2 defines the ranges of labels. So for example, the first row of df_2 start of the range is 9 and the end is 22. Now I want to slice df_1 based on start and end and give this label to the slice. Note that start is exlusive and end is inclusive. And my labels ranges are overlapping.

These are my attempts. The first one works but I am not sure if it is the best.

# attempt_1
dfc = pd.DataFrame([])
for idx, row in df_2.iterrows():
    start = row['start']
    end = row['end']
    label = row['label']
    df_slice = df_1.loc[df_1.a.between(start, end, inclusive='right')]
    df_slice['label'] = label
    dfc = pd.concat([df_slice, dfc], ignore_index=True)

## attempt 2
idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
label = df_2.iloc[idx.get_indexer(df_1.a), 'label']
df_1['label'] = label.to_numpy()

Vitalizzare · Accepted Answer

I'd try to concatenate generated labeled ranges using pandas.concat this way:

template = df_1.set_index('a')
ranges = df_2.values

output = pd.concat(
    template.loc[start:end].assign(label=label) 
    for start, end, label in ranges
).reset_index()

It's close to your solution with two major differences:

All iterations are collapsed into an inner generator.
We use df_1['a'] as an index, which is implied by its nature.

Scott Boston · Answer

A native pandas way

You can use list comprehension with range and explode to create a dataframe to join df_1 on for the each element in df_2's start and end ranges.

df_2_join = df_2.assign(
                 a=[range(s, e + 1) for s, e in zip(df_2["start"], df_2["end"])]
            ).explode("a")

df_1.merge(df_2_join).sort_values('start')

Output:

    a  start  end label
0  10      9   26     a
1  12      9   26     a
2  14      9   26     a
3  20      9   26     a
5  25      9   26     a
4  20     19   50     b
6  25     19   50     b
7  30     19   50     b
8  42     19   50     b
9  50     19   50     b

What is the best way to merge two dataframes that one of them has overlapping ranges?

Tags:

python

pandas

dataframe

Amir

2 Answers

Vitalizzare

A native pandas way

Scott Boston

Recent Activity

Donate For Us

What is the best way to merge two dataframes that one of them has overlapping ranges?

Tags:

python

pandas

dataframe

Amir

2 Answers

Vitalizzare

A native pandas way

Scott Boston

Related questions

Recent Activity

Donate For Us