same as this python pandas: how to find rows in one dataframe but not in another? but with multiple columns
This is the setup:
import pandas as pd df = pd.DataFrame(dict( col1=[0,1,1,2], col2=['a','b','c','b'], extra_col=['this','is','just','something'] )) other = pd.DataFrame(dict( col1=[1,2], col2=['b','c'] ))
Now, I want to select the rows from df
which don't exist in other. I want to do the selection by col1
and col2
In SQL I would do:
select * from df where not exists ( select * from other o where df.col1 = o.col1 and df.col2 = o.col2 )
And in Pandas I can do something like this but it feels very ugly. Part of the ugliness could be avoided if df had id-column but it's not always available.
key_col = ['col1','col2'] df_with_idx = df.reset_index() common = pd.merge(df_with_idx,other,on=key_col)['index'] mask = df_with_idx['index'].isin(common) desired_result = df_with_idx[~mask].drop('index',axis=1)
So maybe there is some more elegant way?
The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. For this syntax dataframes can have any number of columns and even different indices.
We use the concat() method to do so. In this method, we input DataFrames in a list as a parameter to it and remove duplicate rows from the resultant data frame using the drop_duplicates() method.
Now we will use dataframe. loc[] function to select the row values of the first data frame using the indexes of the second data frame. Pandas DataFrame. loc[] attribute access a group of rows and columns by label(s) or a boolean array in the given DataFrame.
Checking for missing values using isnull() In order to check null values in Pandas DataFrame, we use isnull() function this function return dataframe of Boolean values which are True for NaN values.
Since 0.17.0
there is a new indicator
param you can pass to merge
which will tell you whether the rows are only present in left, right or both:
In [5]: merged = df.merge(other, how='left', indicator=True) merged Out[5]: col1 col2 extra_col _merge 0 0 a this left_only 1 1 b is both 2 1 c just left_only 3 2 b something left_only In [6]: merged[merged['_merge']=='left_only'] Out[6]: col1 col2 extra_col _merge 0 0 a this left_only 2 1 c just left_only 3 2 b something left_only
So you can now filter the merged df by selecting only 'left_only'
rows
Interesting
cols = ['col1','col2'] #get copies where the indeces are the columns of interest df2 = df.set_index(cols) other2 = other.set_index(cols) #Look for index overlap, ~ df[~df2.index.isin(other2.index)]
Returns:
col1 col2 extra_col 0 0 a this 2 1 c just 3 2 b something
Seems a little bit more elegant...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With