I have a dataframe such as:
Groups Name start end sum
1 G1 A 451 954 1405
2 G1 B 451 951 1402
3 G1 C 451 969 1420
4 G1 D 463 870 1333
5 G1 E 463 888 1351
6 G1 X 230 450 680
7 G1 Z 229 450 681
8 G2 F 119 841 960
9 G2 G 118 842 960
10 G3 H 460 790 1250
11 G3 I 123 300 177
12 G4 J 343 878 1221
13 G4 K 343 878 1221
14 G4 L 320 862 1182
I would like for each Groups to keep only one interval representant (an interval meaning that df.start and df.end overlaps between rows), I explain:
For exemple, in the G1 there are 2 intervals groups :
Interval 1 (with min = 451 and max = 969):
Name start end sum
A 451 954 1405
B 451 951 1402
C 451 969 1420
D 463 870 1333
E 463 888 1351
Then I take the biggest df.sum (here 1420)
and
Interval2 (with min = 229 and max = 450)
Name start end sum
X 230 450 680
Z 229 450 681
Then I take the biggest df.sum (here 681)
If I do that for the whole dataframe I get:
Groups Name start end sum
3 G1 C 451 969 1420
7 G1 Z 229 450 681
9 G2 G 118 842 960
10 G3 H 460 790 1250
11 G3 I 123 300 177
12 G4 J 343 878 1221
Does someone have an idea?
Here are the data in dictionary format :
{'Groups Name start end sum': {0: 'G1 A 451 954 1405', 1: 'G1 B 451 951 1402', 2: 'G1 C 451 969 1420', 3: 'G1 D 463 870 1333', 4: 'G1 E 463 888 1351', 5: 'G1 X 230 450 680', 6: 'G1 Z 229 450 681', 7: 'G2 F 119 841 960', 8: 'G2 G 118 842 960', 9: 'G3 H 460 790 1250', 10: 'G3 I 123 300 177', 11: 'G4 J 343 878 1221', 12: 'G4 K 343 878 1221', 13: 'G4 L 320 862 1182'}}
You can group the dataframe by both the Groups column and a new column representing the overlapping ranges. What you can do is first sort the dataframe by Groups and end (as well as start if there are duplicates in end).
df = df.sort_values(['Groups', 'end', 'start'])
Now, since we know the order of the rows, we can create the additional column mentioned above, let's call it overlap:
c1 = df['Groups'].shift() != df['Groups']
c2 = df['end'].shift() - df['start'] < 0
df['overlap'] = (c1 | c2).cumsum()
Current dataframe:
Groups Name start end sum overlap
7 G1 Z 229 450 681 1
6 G1 X 230 450 680 1
4 G1 D 463 870 1333 2
5 G1 E 463 888 1351 2
2 G1 B 451 951 1402 2
1 G1 A 451 954 1405 2
3 G1 C 451 969 1420 2
8 G2 F 119 841 960 3
9 G2 G 118 842 960 3
11 G3 I 123 300 177 4
10 G3 H 460 790 1250 5
14 G4 L 320 862 1182 6
12 G4 J 343 878 1221 6
13 G4 K 343 878 1221 6
Finally, we get the row with the maximum sum in each group using groupby.
df.sort_values(['sum'], ascending=False).groupby('overlap').first()
Result:
Groups Name start end sum
overlap
1 G1 Z 229 450 681
2 G1 C 451 969 1420
3 G2 F 119 841 960
4 G3 I 123 300 177
5 G3 H 460 790 1250
6 G4 J 343 878 1221
For the selection of rows where the sum is the same in a group, it is possible to use an additional column when sorting by the sum for second-level ordering. For example, if the order should be based on the original ordering of the rows a new index column can be added at the start by using reset_index().
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With