Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas : Delete rows based on other rows

I have a pandas dataframe which looks like that :

qseqid  sseqid  qstart    qend
2         1     125       345
4         1     150       320
3         2     150       450
6         2     25        300
8         2     50        500

I would like to remove rows based on other rows values with these criterias : A row (r1) must be removed if another row (r2) exist with the same sseqid and r1[qstart] > r2[qstart] and r1[qend] < r2[qend].

Is this possible with pandas ?

like image 617
jsgounot Avatar asked Aug 30 '16 09:08

jsgounot


People also ask

How do I delete rows in pandas DataFrame based on condition?

Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).

How do you delete a row from a DataFrame based on multiple column values?

Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.


1 Answers

df  = pd.DataFrame({'qend': [345, 320, 450, 300, 500],
 'qseqid': [2, 4, 3, 6, 8],
 'qstart': [125, 150, 150, 25, 50],
 'sseqid': [1, 1, 2, 2, 2]})

def remove_rows(df):
    merged = pd.merge(df.reset_index(), df, on='sseqid')
    mask = ((merged['qstart_x'] > merged['qstart_y']) 
            & (merged['qend_x'] < merged['qend_y']))
    df_mask = ~df.index.isin(merged.loc[mask, 'index'].values)
    result = df.loc[df_mask]
    return result

result = remove_rows(df)
print(result)

yields

   qend  qseqid  qstart  sseqid
0   345       2     125       1
3   300       6      25       2
4   500       8      50       2

The idea is to use pd.merge to form a DataFrame with every pairing of rows with the same sseqid:

In [78]: pd.merge(df.reset_index(), df, on='sseqid')
Out[78]: 
    index  qend_x  qseqid_x  qstart_x  sseqid  qend_y  qseqid_y  qstart_y
0       0     345         2       125       1     345         2       125
1       0     345         2       125       1     320         4       150
2       1     320         4       150       1     345         2       125
3       1     320         4       150       1     320         4       150
4       2     450         3       150       2     450         3       150
5       2     450         3       150       2     300         6        25
6       2     450         3       150       2     500         8        50
7       3     300         6        25       2     450         3       150
8       3     300         6        25       2     300         6        25
9       3     300         6        25       2     500         8        50
10      4     500         8        50       2     450         3       150
11      4     500         8        50       2     300         6        25
12      4     500         8        50       2     500         8        50

Each row of merged contains data from two rows of df. You can then compare every two rows using

mask = ((merged['qstart_x'] > merged['qstart_y']) 
        & (merged['qend_x'] < merged['qend_y']))

and find the labels in df.index that do not match this condition:

df_mask = ~df.index.isin(merged.loc[mask, 'index'].values)

and select those rows:

result = df.loc[df_mask]

Note that this assumes df has a unique index.

like image 89
unutbu Avatar answered Oct 09 '22 21:10

unutbu