Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to merge two dataframes that one of them has overlapping ranges?

My DataFrames are:

import pandas as pd

df_1 = pd.DataFrame(
    {
        'a': [10, 12, 14, 20, 25, 30, 42, 50, 80]
    }
)

df_2 = pd.DataFrame(
    {
        'start': [9, 19],
        'end': [26, 50],
        'label': ['a', 'b']
    }
)

Expected output: Adding column label to df_1:

a    label
10    a
12    a
14    a
20    a
25    a
20    b
25    b
30    b
42    b
50    b

df_2 defines the ranges of labels. So for example, the first row of df_2 start of the range is 9 and the end is 22. Now I want to slice df_1 based on start and end and give this label to the slice. Note that start is exlusive and end is inclusive. And my labels ranges are overlapping.

These are my attempts. The first one works but I am not sure if it is the best.

# attempt_1
dfc = pd.DataFrame([])
for idx, row in df_2.iterrows():
    start = row['start']
    end = row['end']
    label = row['label']
    df_slice = df_1.loc[df_1.a.between(start, end, inclusive='right')]
    df_slice['label'] = label
    dfc = pd.concat([df_slice, dfc], ignore_index=True)

## attempt 2
idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
label = df_2.iloc[idx.get_indexer(df_1.a), 'label']
df_1['label'] = label.to_numpy()
like image 624
Amir Avatar asked Oct 13 '25 01:10

Amir


2 Answers

I'd try to concatenate generated labeled ranges using pandas.concat this way:

template = df_1.set_index('a')
ranges = df_2.values

output = pd.concat(
    template.loc[start:end].assign(label=label) 
    for start, end, label in ranges
).reset_index()

It's close to your solution with two major differences:

  1. All iterations are collapsed into an inner generator.
  2. We use df_1['a'] as an index, which is implied by its nature.
like image 57
Vitalizzare Avatar answered Oct 14 '25 17:10

Vitalizzare


A native pandas way

You can use list comprehension with range and explode to create a dataframe to join df_1 on for the each element in df_2's start and end ranges.

df_2_join = df_2.assign(
                 a=[range(s, e + 1) for s, e in zip(df_2["start"], df_2["end"])]
            ).explode("a")

df_1.merge(df_2_join).sort_values('start')

Output:

    a  start  end label
0  10      9   26     a
1  12      9   26     a
2  14      9   26     a
3  20      9   26     a
5  25      9   26     a
4  20     19   50     b
6  25     19   50     b
7  30     19   50     b
8  42     19   50     b
9  50     19   50     b
like image 29
Scott Boston Avatar answered Oct 14 '25 16:10

Scott Boston