Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Anti-merge" in pandas (Python)

How can I pick out the difference between to columns of the same name in two dataframes? I mean I have dataframe A with a column named X and dataframe B with column named X, if i do pd.merge(A, B, on=['X']), i'll get the common X values of A and B, but how can i get the "non-common" ones?

like image 766
Polly Avatar asked Jul 07 '16 09:07

Polly


People also ask

How do you left anti join in Pandas?

We can use the '~' operator on the semi-join. It results in anti-join. Semi-join: Similar to inner join, semi-join returns the intersection but it only returns the columns from the left table and not the right.

What is difference between Pandas concat and merge?

Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.

Is merge or join faster Pandas?

As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.

Is Pandas merge efficient?

Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.


2 Answers

If you change the merge type to how='outer' and indicator=True this will add a column to tell you whether the values are left/both/right only:

In [2]: A = pd.DataFrame({'x':np.arange(5)}) B = pd.DataFrame({'x':np.arange(3,8)}) print(A) print(B)    x 0  0 1  1 2  2 3  3 4  4    x 0  3 1  4 2  5 3  6 4  7  In [3]: pd.merge(A,B, how='outer', indicator=True)  Out[3]:      x      _merge 0  0.0   left_only 1  1.0   left_only 2  2.0   left_only 3  3.0        both 4  4.0        both 5  5.0  right_only 6  6.0  right_only 7  7.0  right_only 

You can then filter the resultant merged df on the _merge col:

In [4]: merged = pd.merge(A,B, how='outer', indicator=True) merged[merged['_merge'] == 'left_only']  Out[4]:      x     _merge 0  0.0  left_only 1  1.0  left_only 2  2.0  left_only 

You can also use isin and negate the mask to find values not in B:

In [5]: A[~A['x'].isin(B['x'])]  Out[5]:    x 0  0 1  1 2  2 
like image 131
EdChum Avatar answered Oct 02 '22 12:10

EdChum


The accepted answer gives a so called LEFT JOIN IF NULL in SQL terms. If you want all the rows except the matching ones from both DataFrames, not only left. You have to add another condition to the filter, since you want to exclude all rows which are in both.

In this case we use DataFrame.merge & DataFrame.query:

df1 = pd.DataFrame({'A':list('abcde')}) df2 = pd.DataFrame({'A':list('cdefgh')})  print(df1, '\n') print(df2)     A 0  a # <- only df1 1  b # <- only df1 2  c # <- both 3  d # <- both 4  e # <- both     A  0  c # both 1  d # both 2  e # both 3  f # <- only df2 4  g # <- only df2 5  h # <- only df2 
df = (     df1.merge(df2,                on='A',                how='outer',                indicator=True)     .query('_merge != "both"')     .drop(columns='_merge') )  print(df)     A 0  a 1  b 5  f 6  g 7  h 
like image 33
Erfan Avatar answered Oct 02 '22 12:10

Erfan