Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group rows by overlapping ranges

I have a dataframe, where the left column is the left - most location of an object, and the right column is the right most location. I need to group the objects if they overlap, or they overlap objects that overlap (recursively). So, for example, if this is my dataframe:

     left  right
0      0    4
1      5    8
2      10   13
3      3    7
4      12   19      
5      18   23
6      31   35

so lines 0 and 3 overlap - thus they should be on the same group, and also line 1 is overlapping line 3 - thus it joins the group.

So, for this example the output should be something like that:

     left  right    group
0      0    4         0
1      5    8         0
2      10   13        1
3      3    7         0
4      12   19        1
5      18   23        1
6      31   35        2

I thought of various directions, but didn't figure it out (without an ugly for). Any help will be appreciated!

like image 299
Binyamin Even Avatar asked Jan 13 '18 19:01

Binyamin Even


2 Answers

I found the accepted solution (update: now deleted) to be misleading because it fails to generalize to similar cases. e.g. for the following example:

df = pd.DataFrame({'left': [0,5,10,3,12,13,18,31], 
    'right':[4,8,13,7,19,16,23,35]})
df

The suggested aggregate function outputs the following dataframe (note that the 18-23 should be in group 1, along with 12-19).

enter image description here

One solution is using the following approach (based on a method for combining intervals posted by @CentAu):

# Union intervals by @CentAu
from sympy import Interval, Union
def union(data):
    """ Union of a list of intervals e.g. [(1,2),(3,4)] """
    intervals = [Interval(begin, end) for (begin, end) in data]
    u = Union(*intervals)
    return [u] if isinstance(u, Interval) \
        else list(u.args)

# Create a list of intervals
df['left_right'] = df[['left', 'right']].apply(list, axis=1)
intervals = union(df.left_right)

# Add a group column
df['group'] = df['left'].apply(lambda x: [g for g,l in enumerate(intervals) if 
l.contains(x)][0])

...which outputs:

enter image description here

like image 132
tomp Avatar answered Sep 23 '22 19:09

tomp


Can you try this, use rolling max and rolling min, to find the intersection of the range :

df=df.sort_values(['left','right'])
df['Group']=((df.right.rolling(window=2,min_periods=1).min()-df.left.rolling(window=2,min_periods=1).max())<0).cumsum()


df.sort_index()
Out[331]: 
   left  right  Group
0     0      4      0
1     5      8      0
2    10     13      1
3     3      7      0
4    12     19      1
5    18     23      1
6    31     35      2

For example , (1,3) and (2,4) To find the intersection

mix(3,4)-max(1,2)=1 ; 1 is more than 0; then two intervals have intersection

like image 34
BENY Avatar answered Sep 26 '22 19:09

BENY