Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Eliminate pandas dataframe rows with partial matches

I want to compare two pandas dataframes and find out the rows that are only in df1, by comparing the values in column A and B. I feel like I could somehow perform this by using merge but cannot figure out..

import pandas as pd

df1 = pd.DataFrame([[1,11, 111], [2,22, 222], [3, 33, 333]], columns=['A', 'B', 'C'])
df2 = pd.DataFrame([[1, 11]], columns=['A', 'B'])

df1

    A   B   C
0   1   11  111
1   2   22  222
2   3   33  333

df2

    A   B
0   1   11

Dataframe I want to see

    A   B   C
1   2   22  222
2   3   33  333
like image 355
E.K. Avatar asked Oct 19 '22 23:10

E.K.


2 Answers

Based on this approach:

import pandas as pd

df1 = pd.DataFrame([[1,11, 111], [2,22, 222], [3, 33, 333]], columns=['A', 'B', 'C'])
df2 = pd.DataFrame([[1, 11]], columns=['A', 'B'])

Concatenate the dataframes:

df = pd.concat([df1, df2])
df.reset_index(drop=True)

Groupby your desired comparison columns - in your case, A and B:

df_gpby = df.groupby(['A','B'])

Get the indexes for the groups with only one value - i.e. unique A,B pairs:

idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

Subset the concatenated dataframe by the indexes:

df.iloc[idx]

Results in:

    A   B   C
1   2   22  222
2   3   33  333
like image 190
vmg Avatar answered Oct 21 '22 15:10

vmg


While vmg's solution is neat, it requires you to know which columns you need to group by. A more generic approach is this:

First subtract one data frame from another:

In [46]: df3 = df1.subtract(df2)

In [47]: df3
Out[47]: 
    A   B   C
0   0   0 NaN
1 NaN NaN NaN
2 NaN NaN NaN

You see that the interesting rows, are those who don't exist in df2, so they are all NaN. Using numpy method you can find those rows:

In [50]: np.isnan(df3.iloc[0])
Out[50]: 
A    False
B    False
C     True
Name: 0, dtype: bool

In [51]: np.isnan(df3.iloc[1])
Out[51]: 
A    True
B    True
C    True
Name: 1, dtype: bool

Now, that you know how to locate those rows, you can do a crazy one liner:

In [52]: df1.iloc[[idx for idx, row in df3.iterrows() if 
                  all(np.isnan(df3.iloc[idx]))]]
Out[52]: 
   A   B    C
1  2  22  222
2  3  33  333

update, let's add a generic function

def substract_dataframe(df1, df2):
    for i in [df1, df2]:
        if not isinstance(i, pd.DataFrame):
            raise ValueError(("Wrong argument given! 
                             All arguments must be DataFrame instances"))
    df = df1.subtract(df2)
    return df1.iloc[[idx for idx, row in df.iterrows() if      
                    all(np.isnan(df.iloc[idx]))]]

testing ...

In [54]: substract_dataframe(df1, df2)
Out[54]: 
   A   B    C
1  2  22  222
2  3  33  333

In [55]: substract_dataframe(df1, 'sdf')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-6ce801e88ce4> in <module>()
----> 1 substract_dataframe(df1, 'sdf')

<ipython-input-53-e5d7db966311> in substract_dataframe(df1, df2)
      2     for i in [df1, df2]:
      3         if not isinstance(i, pd.DataFrame):
----> 4             raise ValueError("Wrong argument given! All arguments must be DataFrame instances")
      5     df = df1.subtract(df2)
      6     return df1.iloc[[idx for idx, row in df.iterrows() if all(np.isnan(df.iloc[idx]))]]

ValueError: Wrong argument given! All arguments must be DataFrame instances
like image 21
oz123 Avatar answered Oct 21 '22 15:10

oz123