I have a pandas dataframe which looks like that :
qseqid sseqid qstart qend
2 1 125 345
4 1 150 320
3 2 150 450
6 2 25 300
8 2 50 500
I would like to remove rows based on other rows values with these criterias : A row (r1) must be removed if another row (r2) exist with the same sseqid
and r1[qstart] > r2[qstart]
and r1[qend] < r2[qend]
.
Is this possible with pandas ?
Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).
Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.
df = pd.DataFrame({'qend': [345, 320, 450, 300, 500],
'qseqid': [2, 4, 3, 6, 8],
'qstart': [125, 150, 150, 25, 50],
'sseqid': [1, 1, 2, 2, 2]})
def remove_rows(df):
merged = pd.merge(df.reset_index(), df, on='sseqid')
mask = ((merged['qstart_x'] > merged['qstart_y'])
& (merged['qend_x'] < merged['qend_y']))
df_mask = ~df.index.isin(merged.loc[mask, 'index'].values)
result = df.loc[df_mask]
return result
result = remove_rows(df)
print(result)
yields
qend qseqid qstart sseqid
0 345 2 125 1
3 300 6 25 2
4 500 8 50 2
The idea is to use pd.merge
to form a DataFrame with every pairing of rows
with the same sseqid
:
In [78]: pd.merge(df.reset_index(), df, on='sseqid')
Out[78]:
index qend_x qseqid_x qstart_x sseqid qend_y qseqid_y qstart_y
0 0 345 2 125 1 345 2 125
1 0 345 2 125 1 320 4 150
2 1 320 4 150 1 345 2 125
3 1 320 4 150 1 320 4 150
4 2 450 3 150 2 450 3 150
5 2 450 3 150 2 300 6 25
6 2 450 3 150 2 500 8 50
7 3 300 6 25 2 450 3 150
8 3 300 6 25 2 300 6 25
9 3 300 6 25 2 500 8 50
10 4 500 8 50 2 450 3 150
11 4 500 8 50 2 300 6 25
12 4 500 8 50 2 500 8 50
Each row of merged contains data from two rows of df. You can then compare every two rows using
mask = ((merged['qstart_x'] > merged['qstart_y'])
& (merged['qend_x'] < merged['qend_y']))
and find the labels in df.index
that do not match this condition:
df_mask = ~df.index.isin(merged.loc[mask, 'index'].values)
and select those rows:
result = df.loc[df_mask]
Note that this assumes df
has a unique index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With