Compare Python Pandas DataFrames for matching rows

Tags:

I have this DataFrame (df1) in Pandas:

df1 = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD')) print df1         A         B         C         D 0.860379  0.726956  0.394529  0.833217 0.014180  0.813828  0.559891  0.339647 0.782838  0.698993  0.551252  0.361034 0.833370  0.982056  0.741821  0.006864 0.855955  0.546562  0.270425  0.136006 0.491538  0.445024  0.971603  0.690001 0.911696  0.065338  0.796946  0.853456 0.744923  0.545661  0.492739  0.337628 0.576235  0.219831  0.946772  0.752403 0.164873  0.454862  0.745890  0.437729

I would like to check if any row (all columns) from another dataframe (df2) are present in df1. Here is df2:

df2 = df1.ix[4:8] df2.reset_index(drop=True,inplace=True) df2.loc[-1] = [2, 3, 4, 5] df2.loc[-2] = [14, 15, 16, 17] df2.reset_index(drop=True,inplace=True) print df2             A         B         C         D     0.855955  0.546562  0.270425  0.136006     0.491538  0.445024  0.971603  0.690001     0.911696  0.065338  0.796946  0.853456     0.744923  0.545661  0.492739  0.337628     0.576235  0.219831  0.946772  0.752403     2.000000  3.000000  4.000000  5.000000    14.000000 15.000000 16.000000 17.000000

I tried using df.lookup to search for one row at a time. I did it this way:

list1 = df2.ix[0].tolist() cols = df1.columns.tolist() print df1.lookup(list1, cols)

but I got this error message:

  File "C:\Users\test.py", line 19, in <module>     print df1.lookup(list1, cols)   File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2217, in lookup     raise KeyError('One or more row labels was not found') KeyError: 'One or more row labels was not found'

I also tried .all() using:

print (df2 == df1).all(1).any()

but I got this error message:

  File "C:\Users\test.py", line 12, in <module>     print (df2 == df1).all(1).any()   File "C:\python27\lib\site-packages\pandas\core\ops.py", line 884, in f     return self._compare_frame(other, func, str_rep)   File "C:\python27\lib\site-packages\pandas\core\frame.py", line 3010, in _compare_frame     raise ValueError('Can only compare identically-labeled ' ValueError: Can only compare identically-labeled DataFrame objects

I also tried isin() like this:

print df2.isin(df1)

but I got False everywhere, which is not correct:

    A      B      C      D False  False  False  False False  False  False  False False  False  False  False False  False  False  False False  False  False  False False  False  False  False False  False  False  False False  False  False  False False  False  False  False False  False  False  False

Is it possible to search for a set of rows in a DataFrame, by comparing it to another dataframe's rows?

EDIT: Is is possible to drop df2 rows if those rows are also present in df1?

246

asked Apr 06 '15 01:04

edesz

2 Answers

One possible solution to your problem would be to use merge. Checking if any row (all columns) from another dataframe (df2) are present in df1 is equivalent to determining the intersection of the the two dataframes. This can be accomplished using the following function:

pd.merge(df1, df2, on=['A', 'B', 'C', 'D'], how='inner')

For example, if df1 was

    A           B            C          D 0   0.403846    0.312230    0.209882    0.397923 1   0.934957    0.731730    0.484712    0.734747 2   0.588245    0.961589    0.910292    0.382072 3   0.534226    0.276908    0.323282    0.629398 4   0.259533    0.277465    0.043652    0.925743 5   0.667415    0.051182    0.928655    0.737673 6   0.217923    0.665446    0.224268    0.772592 7   0.023578    0.561884    0.615515    0.362084 8   0.346373    0.375366    0.083003    0.663622 9   0.352584    0.103263    0.661686    0.246862

and df2 was defined as:

     A          B            C           D 0   0.259533    0.277465    0.043652    0.925743 1   0.667415    0.051182    0.928655    0.737673 2   0.217923    0.665446    0.224268    0.772592 3   0.023578    0.561884    0.615515    0.362084 4   0.346373    0.375366    0.083003    0.663622 5   2.000000    3.000000    4.000000    5.000000 6   14.000000   15.000000   16.000000   17.000000

The function pd.merge(df1, df2, on=['A', 'B', 'C', 'D'], how='inner') produces:

     A           B           C           D 0   0.259533    0.277465    0.043652    0.925743 1   0.667415    0.051182    0.928655    0.737673 2   0.217923    0.665446    0.224268    0.772592 3   0.023578    0.561884    0.615515    0.362084 4   0.346373    0.375366    0.083003    0.663622

The results are all of the rows (all columns) that are both in df1 and df2.

We can also modify this example if the columns are not the same in df1 and df2 and just compare the row values that are the same for a subset of the columns. If we modify the original example:

df1 = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD')) df2 = df1.ix[4:8] df2.reset_index(drop=True,inplace=True) df2.loc[-1] = [2, 3, 4, 5] df2.loc[-2] = [14, 15, 16, 17] df2.reset_index(drop=True,inplace=True) df2 = df2[['A', 'B', 'C']] # df2 has only columns A B C

Then we can look at the common columns using common_cols = list(set(df1.columns) & set(df2.columns)) between the two dataframes then merge:

pd.merge(df1, df2, on=common_cols, how='inner')

EDIT: New question (comments), having identified the rows from df2 that were also present in the first dataframe (df1), is it possible to take the result of the pd.merge() and to then drop the rows from df2 that are also present in df1

I do not know of a straightforward way to accomplish the task of dropping the rows from df2 that are also present in df1. That said, you could use the following:

ds1 = set(tuple(line) for line in df1.values) ds2 = set(tuple(line) for line in df2.values) df = pd.DataFrame(list(ds2.difference(ds1)), columns=df2.columns)

There probably exists a better way to accomplish that task but i am unaware of such a method / function.

EDIT 2: How to drop the rows from df2 that are also present in df1 as shown in @WR answer.

The method provided df2[~df2['A'].isin(df12['A'])] does not account for all types of situations. Consider the following DataFrames:

df1:

   A  B  C  D 0  6  4  1  6 1  7  6  6  8 2  1  6  2  7 3  8  0  4  1 4  1  0  2  3 5  8  4  7  5 6  4  7  1  1 7  3  7  3  4 8  5  2  8  8 9  3  2  8  4

df2:

   A  B  C  D 0  1  0  2  3 1  8  4  7  5 2  4  7  1  1 3  3  7  3  4 4  5  2  8  8 5  1  1  1  1 6  2  2  2  2

df12:

   A  B  C  D 0  1  0  2  3 1  8  4  7  5 2  4  7  1  1 3  3  7  3  4 4  5  2  8  8

Using the above DataFrames with the goal of dropping rows from df2 that are also present in df1 would result in the following:

   A  B  C  D 0  1  1  1  1 1  2  2  2  2

Rows (1, 1, 1, 1) and (2, 2, 2, 2) are in df2 and not in df1. Unfortunately, using the provided method (df2[~df2['A'].isin(df12['A'])]) results in:

   A  B  C  D 6  2  2  2  2

This occurs because the value of 1 in column A is found in both the intersection DataFrame (i.e. (1, 0, 2, 3)) and df2 and thus removes both (1, 0, 2, 3) and (1, 1, 1, 1). This is unintended since the row (1, 1, 1, 1) is not in df1 and should not be removed.

I think the following will provide a solution. It creates a dummy column that is later used to subset the DataFrame to the desired results:

df12['key'] = 'x' temp_df = pd.merge(df2, df12, on=df2.columns.tolist(), how='left') temp_df[temp_df['key'].isnull()].drop('key', axis=1)

157

answered Oct 10 '22 16:10

Andrew

@Andrew: I believe I found a way to drop the rows of one dataframe that are already present in another (i.e. to answer my EDIT) without using loops - let me know if you disagree and/or if my OP + EDIT did not clearly state this:

THIS WORKS

The columns for both dataframes are always the same - A, B, C and D. With this in mind, based heavily on Andrew's approach, here is how to drop the rows from df2 that are also present in df1:

common_cols = df1.columns.tolist()                         #generate list of column names df12 = pd.merge(df1, df2, on=common_cols, how='inner')     #extract common rows with merge df2 = df2[~df2['A'].isin(df12['A'])]

Line 3 does the following:

Extract only rows from df2 that do not match rows in df1:
In order for 2 rows to be different, ANY one column of one row must
necessarily be different that the corresponding column in another row.
Here, I picked column A to make this comparison - it is
possible to use any of the column names, but not ALL of the
column names.

NOTE: this method is essentially the equivalent of the SQL NOT IN().

answered Oct 10 '22 16:10

edesz

Related questions
                            
                                Is it possible to unpack a tuple in Python without creating unwanted variables?
                            
                                Easiest way to develop simple GUI in Python [closed]
                            
                                Python - Cleanest way to override __init__ where an optional kwarg must be used after the super() call?
                            
                                Python pandas empty correlation matrix
                            
                                What is a "scalar" in numpy?
                            
                                Python os.walk + follow symlinks
                            
                                Implement list-like index access in Python
                            
                                Trying to use open(filename, 'w' ) gives IOError: [Errno 2] No such file or directory if directory doesn't exist
                            
                                Inserting a degree symbol into python plot
                            
                                Python: Exporting environment variables in subprocess.Popen(..)
                            
                                How to run all PyTest assertions even if some of them fail?
                            
                                Getting data from ctypes array into numpy
                            
                                Setupterm could not find terminal, in Python program using curses
                            
                                Hyperparameter optimization for Pytorch model [closed]
                            
                                How to solve the "Mastermind" guessing game?
                            
                                How to make Django QuerySet bulk delete() more efficient
                            
                                Removing Duplicates From Dictionary
                            
                                What are good practices for avoiding crashes / hangs in PyQt?
                            
                                how to release used memory immediately in python list?
                            
                                What is the Python way of chaining maps and filters?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Compare Python Pandas DataFrames for matching rows

Tags:

python

matching

pandas

rows

edesz

People also ask

2 Answers

Andrew

edesz

Recent Activity

Donate For Us