Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find same data in two DataFrames of different shapes

Tags:

python

pandas

I have two Pandas DataFrames that I would like to compare. For example

    a    b    c
A   na   na  na
B   na   1    1
C   na   1    na

and

    a    b    c
A   1    na   1
B   na   na   na
C   na   1    na
D   na   1    na

I want to find the index-column coordinates for any values that are shared, in this case

    b
C   1

Is this possible?

like image 480
jds Avatar asked Nov 10 '15 22:11

jds


People also ask

How can you tell if two DataFrames have the same value?

DataFrame - equals() function The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

How do you compare two DataFrames and find the difference?

The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.


1 Answers

If you pass the keys parameter to concat, the columns of the resulting dataframe will be comprised of a multi-index which keeps track of the original dataframes:

In [1]: c=pd.concat([df,df2],axis=1,keys=['df1','df2'])
        c

Out[1]:
   df1           df2
     a    b    c   a   b   c
A   na   na   na   1  na   1
B   na    1    1  na  na  na
C   na    1   na  na   1  na
D  NaN  NaN  NaN  na   1  na

Since the underlying arrays now have the same shape, you can now use == to broadcast your comparison and use this as a mask to return all matching values:

In [171]: m=c.df1[c.df1==c.df2];m
Out[171]:
    a   b   c
A NaN NaN NaN
B NaN NaN NaN
C NaN   1 NaN
D NaN NaN NaN

If your 'na' value are actually zeros, you could use a sparse matrix to reduce this to the coordinates of the matching values (you'll lose your index and column names though):

import scipy.sparse as sp
print(sp.coo_matrix(m.where(m.notnull(),0)))
  (2, 1)    1.0
like image 177
maxymoo Avatar answered Oct 18 '22 08:10

maxymoo