My DataFrames are:
import pandas as pd
df_1 = pd.DataFrame(
{
'a': [10, 12, 14, 20, 25, 30, 42, 50, 80]
}
)
df_2 = pd.DataFrame(
{
'start': [9, 19],
'end': [26, 50],
'label': ['a', 'b']
}
)
Expected output: Adding column label
to df_1
:
a label
10 a
12 a
14 a
20 a
25 a
20 b
25 b
30 b
42 b
50 b
df_2
defines the ranges of labels. So for example, the first row of df_2
start of the range is 9 and the end is 22. Now I want to slice df_1
based on start and end and give this label to the slice. Note that start
is exlusive and end
is inclusive. And my labels ranges are overlapping.
These are my attempts. The first one works but I am not sure if it is the best.
# attempt_1
dfc = pd.DataFrame([])
for idx, row in df_2.iterrows():
start = row['start']
end = row['end']
label = row['label']
df_slice = df_1.loc[df_1.a.between(start, end, inclusive='right')]
df_slice['label'] = label
dfc = pd.concat([df_slice, dfc], ignore_index=True)
## attempt 2
idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
label = df_2.iloc[idx.get_indexer(df_1.a), 'label']
df_1['label'] = label.to_numpy()
I'd try to concatenate generated labeled ranges using pandas.concat
this way:
template = df_1.set_index('a')
ranges = df_2.values
output = pd.concat(
template.loc[start:end].assign(label=label)
for start, end, label in ranges
).reset_index()
It's close to your solution with two major differences:
df_1['a']
as an index, which is implied by its nature.You can use list comprehension with range
and explode
to create a dataframe to join df_1 on for the each element in df_2's start and end ranges.
df_2_join = df_2.assign(
a=[range(s, e + 1) for s, e in zip(df_2["start"], df_2["end"])]
).explode("a")
df_1.merge(df_2_join).sort_values('start')
Output:
a start end label
0 10 9 26 a
1 12 9 26 a
2 14 9 26 a
3 20 9 26 a
5 25 9 26 a
4 20 19 50 b
6 25 19 50 b
7 30 19 50 b
8 42 19 50 b
9 50 19 50 b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With