Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter out rows of one python pandas dataframe from another dataframe by comparing columns?

I'm trying to exclude rows from one dataframe, which also occur in another dataframe:

import pandas

df = pandas.DataFrame({'A': ['Chr1', 'Chr1', 'Chr1','Chr1', 'Chr1', 'Chr1','Chr2','Chr2'], 'B': [10,20,30,40,50,60,15,20]})

errors = pandas.DataFrame({'A': ['Chr1', 'Chr1'], 'B': [20,50]})

As a result, the rows in df, that are equal to errors should be left out:

df:
'A'    'B'
Chr1    10
Chr1    30
Chr1    40
Chr1    60
Chr2    15
Chr2    20

It doesn't seem to work with df.merge, and I don't want to iterate over all rows, since the dataframes get pretty large.

Best,

David

like image 925
David Ries Avatar asked Jul 10 '14 12:07

David Ries


2 Answers

Add an extra column to errors

errors['temp'] = 1

Merge the two dataframes

merged_df = pandas.merge(df,errors,how='outer')

Now keep only those rows which have 'temp' as NaN

merged_df = merged_df[ merged_df['temp'] != 1 ]
del merged_df['temp']

print merged_rdf

      A   B
 0  Chr1  10
 2  Chr1  30
 3  Chr1  40
 5  Chr1  60
 6  Chr2  15
 7  Chr2  20
like image 95
Ankush Shah Avatar answered Nov 14 '22 07:11

Ankush Shah


For two columns you can do:

 print df[ ~df['A'].isin(errors['A']) | ~df['B'].isin(errors['B']) ]
like image 43
furas Avatar answered Nov 14 '22 07:11

furas