Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep only largest interval in coordinates row in pandas

I have a dataframe such as:

   Groups Name start end  sum
1      G1    A   451 954 1405
2      G1    B   451 951 1402
3      G1    C   451 969 1420
4      G1    D   463 870 1333
5      G1    E   463 888 1351
6      G1    X   230 450  680
7      G1    Z   229 450  681
8      G2    F   119 841  960
9      G2    G   118 842  960
10     G3    H   460 790 1250
11     G3    I   123 300 177
12     G4    J   343 878 1221
13     G4    K   343 878 1221
14     G4    L   320 862 1182

I would like for each Groups to keep only one interval representant (an interval meaning that df.start and df.end overlaps between rows), I explain:

For exemple, in the G1 there are 2 intervals groups :

Interval 1 (with min = 451 and max = 969):

Name start end sum
A    451   954 1405
B    451   951 1402
C    451   969 1420
D    463   870 1333
E    463   888 1351

Then I take the biggest df.sum (here 1420)

and

Interval2 (with min = 229 and max = 450)

Name start end  sum
X    230   450  680
Z    229   450  681

Then I take the biggest df.sum (here 681)

If I do that for the whole dataframe I get:

   Groups Name start end  sum
3      G1    C   451 969 1420
7      G1    Z   229 450  681
9      G2    G   118 842  960
10     G3    H   460 790 1250
11     G3    I   123 300 177
12     G4    J   343 878 1221

Does someone have an idea?
Here are the data in dictionary format :

{'Groups Name start end  sum': {0: 'G1    A   451 954 1405', 1: 'G1    B   451 951 1402', 2: 'G1    C   451 969 1420', 3: 'G1    D   463 870 1333', 4: 'G1    E   463 888 1351', 5: 'G1    X   230 450  680', 6: 'G1    Z   229 450  681', 7: 'G2    F   119 841  960', 8: 'G2    G   118 842  960', 9: 'G3    H   460 790 1250', 10: 'G3    I   123 300 177', 11: 'G4    J   343 878 1221', 12: 'G4    K   343 878 1221', 13: 'G4    L   320 862 1182'}}
like image 281
chippycentra Avatar asked Nov 07 '22 02:11

chippycentra


1 Answers

You can group the dataframe by both the Groups column and a new column representing the overlapping ranges. What you can do is first sort the dataframe by Groups and end (as well as start if there are duplicates in end).

df = df.sort_values(['Groups', 'end', 'start'])

Now, since we know the order of the rows, we can create the additional column mentioned above, let's call it overlap:

c1 = df['Groups'].shift() != df['Groups']
c2 = df['end'].shift() - df['start'] < 0
df['overlap'] = (c1 | c2).cumsum()

Current dataframe:

Groups Name  start  end   sum  overlap
7      G1    Z    229  450   681        1
6      G1    X    230  450   680        1
4      G1    D    463  870  1333        2
5      G1    E    463  888  1351        2
2      G1    B    451  951  1402        2
1      G1    A    451  954  1405        2
3      G1    C    451  969  1420        2
8      G2    F    119  841   960        3
9      G2    G    118  842   960        3
11     G3    I    123  300   177        4
10     G3    H    460  790  1250        5
14     G4    L    320  862  1182        6
12     G4    J    343  878  1221        6
13     G4    K    343  878  1221        6

Finally, we get the row with the maximum sum in each group using groupby.

df.sort_values(['sum'], ascending=False).groupby('overlap').first()

Result:

        Groups Name  start  end   sum
overlap                              
1           G1    Z    229  450   681
2           G1    C    451  969  1420
3           G2    F    119  841   960
4           G3    I    123  300   177
5           G3    H    460  790  1250
6           G4    J    343  878  1221

For the selection of rows where the sum is the same in a group, it is possible to use an additional column when sorting by the sum for second-level ordering. For example, if the order should be based on the original ordering of the rows a new index column can be added at the start by using reset_index().

like image 199
Shaido Avatar answered Nov 15 '22 08:11

Shaido