I have two dataframes
df_a=
Start Stop Value
0 0 100 0.0
1 101 200 1.0
2 201 1000 0.0
df_b=
Start Stop Value
0 0 50 0.0
1 51 300 1.0
2 301 1000 0.0
I would like to generate a DataFrame which contains the intervals as identified by Start and Stop, where Value was the same in df_a and df_b. For each interval I would like to store: if Value was the same, and which was the value in df_a and df_b.
Desired output:
df_out=
Start Stop SameValue Value_dfA Value_dfB
0 50 1 0 0
51 100 0 0 1
101 200 1 1 1
201 300 0 0 1
[...]
Not sure if this is the best way to do this but you can reindex, join, groupby and agg to get your intervals, e.g.:
Expand each df so that the index is every single value of the range (Start to Stop) using reindex() and padding the values:
In []:
df_a_expanded = df_a.set_index('Start').reindex(range(max(df_a['Stop'])+1)).fillna(method='pad')
df_a_expanded
Out[]:
Stop Value
Start
0 100.0 0.0
1 100.0 0.0
2 100.0 0.0
3 100.0 0.0
4 100.0 0.0
...
997 1000.0 0.0
998 1000.0 0.0
999 1000.0 0.0
1000 1000.0 0.0
[1001 rows x 2 columns]
In []:
df_b_expanded = df_b.set_index('Start').reindex(range(max(df_b['Stop'])+1)).fillna(method='pad')
Join the two expanded dfs:
In []:
df = df_a_expanded.join(df_b_expanded, lsuffix='_dfA', rsuffix='_dfB').reset_index()
df
Out[]:
Start Stop_dfA Value_dfA Stop_dfB Value_dfB
0 0 100.0 0.0 50.0 0.0
1 1 100.0 0.0 50.0 0.0
2 2 100.0 0.0 50.0 0.0
3 3 100.0 0.0 50.0 0.0
4 4 100.0 0.0 50.0 0.0
...
Note: you can ignore the Stop columns and could have dropped them in the previous step.
There is no standard way to groupby only consecutive values (à la itertools.groupby), so resorting to a cumsum() hack:
In []:
groups = (df[['Value_dfA', 'Value_dfB']] != df[['Value_dfA', 'Value_dfB']].shift()).any(axis=1).cumsum()
g = df.groupby([groups, 'Value_dfA', 'Value_dfB'], as_index=False)
Now you can get the result you want by aggregating the group with min, max:
In []:
df_out = g['Start'].agg({'Start': 'min', 'Stop': 'max'})
df_out
Out[]:
Value_dfA Value_dfB Start Stop
0 0.0 0.0 0 50
1 0.0 1.0 51 100
2 1.0 1.0 101 200
3 0.0 1.0 201 300
4 0.0 0.0 301 1000
Now you just have to add the SameValue column and, if desired, order the columns to get the exact output you want:
In []:
df_out['SameValue'] = (df_out['Value_dfA'] == df_out['Value_dfB'])*1
df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]
Out[]:
Start Stop SameValue Value_dfA Value_dfB
0 0 50 1 0.0 0.0
1 51 100 0 0.0 1.0
2 101 200 1 1.0 1.0
3 201 300 0 0.0 1.0
4 301 1000 1 0.0 0.0
This assumes the ranges of the two dataframes are the same, or you will need to handle the NaNs you will get with the join().
I found a way but not sure it is the most efficient. You have the input data:
import pandas as pd
dfa = pd.DataFrame({'Start': [0, 101, 201], 'Stop': [100, 200, 1000], 'Value': [0., 1., 0.]})
dfb = pd.DataFrame({'Start': [0, 51, 301], 'Stop': [50, 300, 1000], 'Value': [0., 1., 0.]})
First I would create the columns Start and Stop of df_out with:
df_out = pd.DataFrame({'Start': sorted(set(dfa['Start'])|set(dfb['Start'])),
'Stop': sorted(set(dfa['Stop'])|set(dfb['Stop']))})
Then to get the value of dfa (and dfb) associated to the right range(Start,Stop) in a column named Value_dfA (and Value_dfB), I would do:
df_out['Value_dfA'] = df_out['Start'].apply(lambda x: dfa['Value'][dfa['Start'] <= x].iloc[-1])
df_out['Value_dfB'] = df_out['Start'].apply(lambda x: dfb['Value'][dfb['Start'] <= x].iloc[-1])
To get the column SameValue, do:
df_out['SameValue'] = df_out.apply(lambda x: 1 if x['Value_dfA'] == x['Value_dfB'] else 0,axis=1)
If it matters, you can reorder the columns with:
df_out = df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]
Your output is then
Start Stop SameValue Value_dfA Value_dfB
0 0 50 1 0.0 0.0
1 51 100 0 0.0 1.0
2 101 200 1 1.0 1.0
3 201 300 0 0.0 1.0
4 301 1000 1 0.0 0.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With