I have two dataframes
df_a=
Start Stop Value
0 0 100 0.0
1 101 200 1.0
2 201 1000 0.0
df_b=
Start Stop Value
0 0 50 0.0
1 51 300 1.0
2 301 1000 0.0
I would like to generate a DataFrame
which contains the intervals as identified by Start
and Stop
, where Value was the same in df_a
and df_b. For each interval I would like to store: if Value
was the same, and which was the value in df_a
and df_b
.
Desired output:
df_out=
Start Stop SameValue Value_dfA Value_dfB
0 50 1 0 0
51 100 0 0 1
101 200 1 1 1
201 300 0 0 1
[...]
Not sure if this is the best way to do this but you can reindex
, join
, groupby
and agg
to get your intervals, e.g.:
Expand each df
so that the index is every single value of the range (Start
to Stop
) using reindex()
and pad
ding the values:
In []:
df_a_expanded = df_a.set_index('Start').reindex(range(max(df_a['Stop'])+1)).fillna(method='pad')
df_a_expanded
Out[]:
Stop Value
Start
0 100.0 0.0
1 100.0 0.0
2 100.0 0.0
3 100.0 0.0
4 100.0 0.0
...
997 1000.0 0.0
998 1000.0 0.0
999 1000.0 0.0
1000 1000.0 0.0
[1001 rows x 2 columns]
In []:
df_b_expanded = df_b.set_index('Start').reindex(range(max(df_b['Stop'])+1)).fillna(method='pad')
Join the two expanded dfs
:
In []:
df = df_a_expanded.join(df_b_expanded, lsuffix='_dfA', rsuffix='_dfB').reset_index()
df
Out[]:
Start Stop_dfA Value_dfA Stop_dfB Value_dfB
0 0 100.0 0.0 50.0 0.0
1 1 100.0 0.0 50.0 0.0
2 2 100.0 0.0 50.0 0.0
3 3 100.0 0.0 50.0 0.0
4 4 100.0 0.0 50.0 0.0
...
Note: you can ignore the Stop
columns and could have dropped them in the previous step.
There is no standard way to groupby
only consecutive values (à la itertools.groupby
), so resorting to a cumsum()
hack:
In []:
groups = (df[['Value_dfA', 'Value_dfB']] != df[['Value_dfA', 'Value_dfB']].shift()).any(axis=1).cumsum()
g = df.groupby([groups, 'Value_dfA', 'Value_dfB'], as_index=False)
Now you can get the result you want by aggregating the group with min
, max
:
In []:
df_out = g['Start'].agg({'Start': 'min', 'Stop': 'max'})
df_out
Out[]:
Value_dfA Value_dfB Start Stop
0 0.0 0.0 0 50
1 0.0 1.0 51 100
2 1.0 1.0 101 200
3 0.0 1.0 201 300
4 0.0 0.0 301 1000
Now you just have to add the SameValue
column and, if desired, order the columns to get the exact output you want:
In []:
df_out['SameValue'] = (df_out['Value_dfA'] == df_out['Value_dfB'])*1
df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]
Out[]:
Start Stop SameValue Value_dfA Value_dfB
0 0 50 1 0.0 0.0
1 51 100 0 0.0 1.0
2 101 200 1 1.0 1.0
3 201 300 0 0.0 1.0
4 301 1000 1 0.0 0.0
This assumes the ranges of the two dataframes are the same, or you will need to handle the NaN
s you will get with the join()
.
I found a way but not sure it is the most efficient. You have the input data:
import pandas as pd
dfa = pd.DataFrame({'Start': [0, 101, 201], 'Stop': [100, 200, 1000], 'Value': [0., 1., 0.]})
dfb = pd.DataFrame({'Start': [0, 51, 301], 'Stop': [50, 300, 1000], 'Value': [0., 1., 0.]})
First I would create the columns Start
and Stop
of df_out
with:
df_out = pd.DataFrame({'Start': sorted(set(dfa['Start'])|set(dfb['Start'])),
'Stop': sorted(set(dfa['Stop'])|set(dfb['Stop']))})
Then to get the value of dfa
(and dfb
) associated to the right range(Start,Stop) in a column named Value_dfA
(and Value_dfB
), I would do:
df_out['Value_dfA'] = df_out['Start'].apply(lambda x: dfa['Value'][dfa['Start'] <= x].iloc[-1])
df_out['Value_dfB'] = df_out['Start'].apply(lambda x: dfb['Value'][dfb['Start'] <= x].iloc[-1])
To get the column SameValue
, do:
df_out['SameValue'] = df_out.apply(lambda x: 1 if x['Value_dfA'] == x['Value_dfB'] else 0,axis=1)
If it matters, you can reorder the columns with:
df_out = df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]
Your output is then
Start Stop SameValue Value_dfA Value_dfB
0 0 50 1 0.0 0.0
1 51 100 0 0.0 1.0
2 101 200 1 1.0 1.0
3 201 300 0 0.0 1.0
4 301 1000 1 0.0 0.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With