I have a dataframe such as:
Groups Name start end sum
1 G1 A 451 954 1405
2 G1 B 451 951 1402
3 G1 C 451 969 1420
4 G1 D 463 870 1333
5 G1 E 463 888 1351
6 G1 X 230 450 680
7 G1 Z 229 450 681
8 G2 F 119 841 960
9 G2 G 118 842 960
10 G3 H 460 790 1250
11 G3 I 123 300 177
12 G4 J 343 878 1221
13 G4 K 343 878 1221
14 G4 L 320 862 1182
I would like for each Groups to keep only one interval representant (an interval meaning that df.start
and df.end
overlaps between rows), I explain:
For exemple, in the G1 there are 2 intervals groups :
Interval 1 (with min = 451
and max = 969
):
Name start end sum
A 451 954 1405
B 451 951 1402
C 451 969 1420
D 463 870 1333
E 463 888 1351
Then I take the biggest df.sum
(here 1420
)
and
Interval2 (with min = 229
and max = 450
)
Name start end sum
X 230 450 680
Z 229 450 681
Then I take the biggest df.sum
(here 681
)
If I do that for the whole dataframe I get:
Groups Name start end sum
3 G1 C 451 969 1420
7 G1 Z 229 450 681
9 G2 G 118 842 960
10 G3 H 460 790 1250
11 G3 I 123 300 177
12 G4 J 343 878 1221
Does someone have an idea?
Here are the data in dictionary format :
{'Groups Name start end sum': {0: 'G1 A 451 954 1405', 1: 'G1 B 451 951 1402', 2: 'G1 C 451 969 1420', 3: 'G1 D 463 870 1333', 4: 'G1 E 463 888 1351', 5: 'G1 X 230 450 680', 6: 'G1 Z 229 450 681', 7: 'G2 F 119 841 960', 8: 'G2 G 118 842 960', 9: 'G3 H 460 790 1250', 10: 'G3 I 123 300 177', 11: 'G4 J 343 878 1221', 12: 'G4 K 343 878 1221', 13: 'G4 L 320 862 1182'}}
You can group the dataframe by both the Groups
column and a new column representing the overlapping ranges. What you can do is first sort the dataframe by Groups
and end
(as well as start
if there are duplicates in end
).
df = df.sort_values(['Groups', 'end', 'start'])
Now, since we know the order of the rows, we can create the additional column mentioned above, let's call it overlap
:
c1 = df['Groups'].shift() != df['Groups']
c2 = df['end'].shift() - df['start'] < 0
df['overlap'] = (c1 | c2).cumsum()
Current dataframe:
Groups Name start end sum overlap
7 G1 Z 229 450 681 1
6 G1 X 230 450 680 1
4 G1 D 463 870 1333 2
5 G1 E 463 888 1351 2
2 G1 B 451 951 1402 2
1 G1 A 451 954 1405 2
3 G1 C 451 969 1420 2
8 G2 F 119 841 960 3
9 G2 G 118 842 960 3
11 G3 I 123 300 177 4
10 G3 H 460 790 1250 5
14 G4 L 320 862 1182 6
12 G4 J 343 878 1221 6
13 G4 K 343 878 1221 6
Finally, we get the row with the maximum sum
in each group using groupby
.
df.sort_values(['sum'], ascending=False).groupby('overlap').first()
Result:
Groups Name start end sum
overlap
1 G1 Z 229 450 681
2 G1 C 451 969 1420
3 G2 F 119 841 960
4 G3 I 123 300 177
5 G3 H 460 790 1250
6 G4 J 343 878 1221
For the selection of rows where the sum
is the same in a group, it is possible to use an additional column when sorting by the sum
for second-level ordering. For example, if the order should be based on the original ordering of the rows a new index
column can be added at the start by using reset_index()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With