Pandas : Delete rows based on other rows

Q: How do I delete rows in pandas DataFrame based on condition?

Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).

Q: How do you delete a row from a DataFrame based on multiple column values?

Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.

Tags:

python

pandas

dataframe

I have a pandas dataframe which looks like that :

qseqid  sseqid  qstart    qend
2         1     125       345
4         1     150       320
3         2     150       450
6         2     25        300
8         2     50        500

I would like to remove rows based on other rows values with these criterias : A row (r1) must be removed if another row (r2) exist with the same sseqid and r1[qstart] > r2[qstart] and r1[qend] < r2[qend].

Is this possible with pandas ?

617

asked Aug 30 '16 09:08

jsgounot

1 Answers

df  = pd.DataFrame({'qend': [345, 320, 450, 300, 500],
 'qseqid': [2, 4, 3, 6, 8],
 'qstart': [125, 150, 150, 25, 50],
 'sseqid': [1, 1, 2, 2, 2]})

def remove_rows(df):
    merged = pd.merge(df.reset_index(), df, on='sseqid')
    mask = ((merged['qstart_x'] > merged['qstart_y']) 
            & (merged['qend_x'] < merged['qend_y']))
    df_mask = ~df.index.isin(merged.loc[mask, 'index'].values)
    result = df.loc[df_mask]
    return result

result = remove_rows(df)
print(result)

yields

   qend  qseqid  qstart  sseqid
0   345       2     125       1
3   300       6      25       2
4   500       8      50       2

The idea is to use pd.merge to form a DataFrame with every pairing of rows with the same sseqid:

In [78]: pd.merge(df.reset_index(), df, on='sseqid')
Out[78]: 
    index  qend_x  qseqid_x  qstart_x  sseqid  qend_y  qseqid_y  qstart_y
0       0     345         2       125       1     345         2       125
1       0     345         2       125       1     320         4       150
2       1     320         4       150       1     345         2       125
3       1     320         4       150       1     320         4       150
4       2     450         3       150       2     450         3       150
5       2     450         3       150       2     300         6        25
6       2     450         3       150       2     500         8        50
7       3     300         6        25       2     450         3       150
8       3     300         6        25       2     300         6        25
9       3     300         6        25       2     500         8        50
10      4     500         8        50       2     450         3       150
11      4     500         8        50       2     300         6        25
12      4     500         8        50       2     500         8        50

Each row of merged contains data from two rows of df. You can then compare every two rows using

mask = ((merged['qstart_x'] > merged['qstart_y']) 
        & (merged['qend_x'] < merged['qend_y']))

and find the labels in df.index that do not match this condition:

df_mask = ~df.index.isin(merged.loc[mask, 'index'].values)

and select those rows:

result = df.loc[df_mask]

Note that this assumes df has a unique index.

answered Oct 09 '22 21:10

unutbu

Related questions
                            
                                Apply a for loop to multiple DataFrames in Pandas
                            
                                Is there a way to force a Python program to run in version 2.7?
                            
                                time slice on second level of multiindex
                            
                                pyinstaller importError: No module name '_socket'
                            
                                Why can't Flask can't see my environment variables from Apache (mod_wsgi)?
                            
                                Compose functions with map
                            
                                Flask-Testing signals not supported error
                            
                                Can anyone explain this error [AttributeError: 'DataFrame' object has no attribute 'to_numeric']
                            
                                Automatic 'created by user' field using django-rest-framework?
                            
                                boto3 Get a resource from a client
                            
                                pandas pivot_table apply aggfunc last instance
                            
                                Embedding thumbnail to mp3 with Youtube-dl raise exception
                            
                                When using scipy.optimize.fmin_bfgs I got TypeError: f() missing 1 required positional argument:
                            
                                Importing pytesseract
                            
                                Pyspark command not recognised
                            
                                Pycharm manage.py autocomplete error
                            
                                Delete pandas group based on condition
                            
                                matplotlib does not show legend in scatter plot
                            
                                Python: Chunking others than noun phrases (e.g. prepositional) using Spacy, etc
                            
                                Alternative for numpy.choose that allows an arbitrary or at least more than 32 arguments?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With