Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Find rows which don't exist in another DataFrame by multiple columns

same as this python pandas: how to find rows in one dataframe but not in another? but with multiple columns

This is the setup:

import pandas as pd  df = pd.DataFrame(dict(     col1=[0,1,1,2],     col2=['a','b','c','b'],     extra_col=['this','is','just','something'] ))  other = pd.DataFrame(dict(     col1=[1,2],     col2=['b','c'] )) 

Now, I want to select the rows from df which don't exist in other. I want to do the selection by col1 and col2

In SQL I would do:

select * from df  where not exists (     select * from other o      where df.col1 = o.col1 and      df.col2 = o.col2 ) 

And in Pandas I can do something like this but it feels very ugly. Part of the ugliness could be avoided if df had id-column but it's not always available.

key_col = ['col1','col2'] df_with_idx = df.reset_index() common = pd.merge(df_with_idx,other,on=key_col)['index'] mask = df_with_idx['index'].isin(common)  desired_result =  df_with_idx[~mask].drop('index',axis=1) 

So maybe there is some more elegant way?

like image 437
Pekka Avatar asked Sep 18 '15 13:09

Pekka


People also ask

How do you find rows from one DataFrame is not in another?

The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. For this syntax dataframes can have any number of columns and even different indices.

How do you find uncommon rows between two Dataframes in Python?

We use the concat() method to do so. In this method, we input DataFrames in a list as a parameter to it and remove duplicate rows from the resultant data frame using the drop_duplicates() method.

How do I get rows from a DataFrame that is in another DataFrame in Python?

Now we will use dataframe. loc[] function to select the row values of the first data frame using the indexes of the second data frame. Pandas DataFrame. loc[] attribute access a group of rows and columns by label(s) or a boolean array in the given DataFrame.

How do you find which columns have missing values in pandas?

Checking for missing values using isnull() In order to check null values in Pandas DataFrame, we use isnull() function this function return dataframe of Boolean values which are True for NaN values.


2 Answers

Since 0.17.0 there is a new indicator param you can pass to merge which will tell you whether the rows are only present in left, right or both:

In [5]: merged = df.merge(other, how='left', indicator=True) merged  Out[5]:    col1 col2  extra_col     _merge 0     0    a       this  left_only 1     1    b         is       both 2     1    c       just  left_only 3     2    b  something  left_only  In [6]:     merged[merged['_merge']=='left_only']  Out[6]:    col1 col2  extra_col     _merge 0     0    a       this  left_only 2     1    c       just  left_only 3     2    b  something  left_only 

So you can now filter the merged df by selecting only 'left_only' rows

like image 65
EdChum Avatar answered Oct 13 '22 00:10

EdChum


Interesting

cols = ['col1','col2'] #get copies where the indeces are the columns of interest df2 = df.set_index(cols) other2 = other.set_index(cols) #Look for index overlap, ~ df[~df2.index.isin(other2.index)] 

Returns:

    col1 col2  extra_col 0     0    a       this 2     1    c       just 3     2    b  something 

Seems a little bit more elegant...

like image 37
greg_data Avatar answered Oct 12 '22 22:10

greg_data