Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

finding intersection of intervals in pandas

I have two dataframes

df_a=

     Start Stop Value
    0  0     100  0.0
    1  101   200  1.0
    2  201  1000  0.0

df_b=
       Start Stop Value
    0  0     50 0.0
    1  51   300 1.0
    2  301 1000  0.0

I would like to generate a DataFrame which contains the intervals as identified by Start and Stop, where Value was the same in df_a and df_b. For each interval I would like to store: if Value was the same, and which was the value in df_a and df_b. Desired output:

df_out=
  Start Stop SameValue Value_dfA Value_dfB
      0    50    1          0       0
      51   100   0          0       1
      101  200   1          1       1
      201  300   0          0       1
    [...]
like image 376
00__00__00 Avatar asked Apr 24 '18 18:04

00__00__00


2 Answers

Not sure if this is the best way to do this but you can reindex, join, groupby and agg to get your intervals, e.g.:

Expand each df so that the index is every single value of the range (Start to Stop) using reindex() and padding the values:

In []:
df_a_expanded = df_a.set_index('Start').reindex(range(max(df_a['Stop'])+1)).fillna(method='pad')
df_a_expanded

Out[]:
         Stop  Value
Start               
0       100.0    0.0
1       100.0    0.0
2       100.0    0.0
3       100.0    0.0
4       100.0    0.0
...
997    1000.0    0.0
998    1000.0    0.0
999    1000.0    0.0
1000   1000.0    0.0

[1001 rows x 2 columns]

In []:
df_b_expanded = df_b.set_index('Start').reindex(range(max(df_b['Stop'])+1)).fillna(method='pad')

Join the two expanded dfs:

In []:
df = df_a_expanded.join(df_b_expanded, lsuffix='_dfA', rsuffix='_dfB').reset_index()
df

Out[]:
      Start  Stop_dfA  Value_dfA  Stop_dfB  Value_dfB
0         0     100.0        0.0      50.0        0.0
1         1     100.0        0.0      50.0        0.0
2         2     100.0        0.0      50.0        0.0
3         3     100.0        0.0      50.0        0.0
4         4     100.0        0.0      50.0        0.0
...

Note: you can ignore the Stop columns and could have dropped them in the previous step.

There is no standard way to groupby only consecutive values (à la itertools.groupby), so resorting to a cumsum() hack:

In []:
groups = (df[['Value_dfA', 'Value_dfB']] != df[['Value_dfA', 'Value_dfB']].shift()).any(axis=1).cumsum()
g = df.groupby([groups, 'Value_dfA', 'Value_dfB'], as_index=False)

Now you can get the result you want by aggregating the group with min, max:

In []:
df_out = g['Start'].agg({'Start': 'min', 'Stop': 'max'})
df_out

Out[]:
   Value_dfA  Value_dfB  Start  Stop
0        0.0        0.0      0    50
1        0.0        1.0     51   100
2        1.0        1.0    101   200
3        0.0        1.0    201   300
4        0.0        0.0    301  1000

Now you just have to add the SameValue column and, if desired, order the columns to get the exact output you want:

In []:
df_out['SameValue'] = (df_out['Value_dfA'] == df_out['Value_dfB'])*1
df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]

Out[]:
   Start  Stop  SameValue  Value_dfA  Value_dfB
0      0    50          1        0.0        0.0
1     51   100          0        0.0        1.0
2    101   200          1        1.0        1.0
3    201   300          0        0.0        1.0
4    301  1000          1        0.0        0.0

This assumes the ranges of the two dataframes are the same, or you will need to handle the NaNs you will get with the join().

like image 68
AChampion Avatar answered Sep 21 '22 21:09

AChampion


I found a way but not sure it is the most efficient. You have the input data:

import pandas as pd
dfa = pd.DataFrame({'Start': [0, 101, 201], 'Stop': [100, 200, 1000], 'Value': [0., 1., 0.]})
dfb = pd.DataFrame({'Start': [0, 51, 301], 'Stop': [50, 300, 1000], 'Value': [0., 1., 0.]})

First I would create the columns Start and Stop of df_out with:

df_out = pd.DataFrame({'Start': sorted(set(dfa['Start'])|set(dfb['Start'])), 
                       'Stop': sorted(set(dfa['Stop'])|set(dfb['Stop']))})

Then to get the value of dfa (and dfb) associated to the right range(Start,Stop) in a column named Value_dfA (and Value_dfB), I would do:

df_out['Value_dfA'] = df_out['Start'].apply(lambda x: dfa['Value'][dfa['Start'] <= x].iloc[-1])
df_out['Value_dfB'] = df_out['Start'].apply(lambda x: dfb['Value'][dfb['Start'] <= x].iloc[-1])

To get the column SameValue, do:

df_out['SameValue'] = df_out.apply(lambda x: 1 if x['Value_dfA'] == x['Value_dfB'] else 0,axis=1)

If it matters, you can reorder the columns with:

df_out = df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]

Your output is then

   Start  Stop  SameValue  Value_dfA  Value_dfB
0      0    50          1        0.0        0.0
1     51   100          0        0.0        1.0
2    101   200          1        1.0        1.0
3    201   300          0        0.0        1.0
4    301  1000          1        0.0        0.0
like image 45
Ben.T Avatar answered Sep 21 '22 21:09

Ben.T