How can I pick out the difference between to columns of the same name in two dataframes? I mean I have dataframe A with a column named X and dataframe B with column named X, if i do pd.merge(A, B, on=['X'])
, i'll get the common X values of A and B, but how can i get the "non-common" ones?
We can use the '~' operator on the semi-join. It results in anti-join. Semi-join: Similar to inner join, semi-join returns the intersection but it only returns the columns from the left table and not the right.
Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.
As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.
Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.
If you change the merge type to how='outer'
and indicator=True
this will add a column to tell you whether the values are left/both/right only:
In [2]: A = pd.DataFrame({'x':np.arange(5)}) B = pd.DataFrame({'x':np.arange(3,8)}) print(A) print(B) x 0 0 1 1 2 2 3 3 4 4 x 0 3 1 4 2 5 3 6 4 7 In [3]: pd.merge(A,B, how='outer', indicator=True) Out[3]: x _merge 0 0.0 left_only 1 1.0 left_only 2 2.0 left_only 3 3.0 both 4 4.0 both 5 5.0 right_only 6 6.0 right_only 7 7.0 right_only
You can then filter the resultant merged df on the _merge
col:
In [4]: merged = pd.merge(A,B, how='outer', indicator=True) merged[merged['_merge'] == 'left_only'] Out[4]: x _merge 0 0.0 left_only 1 1.0 left_only 2 2.0 left_only
You can also use isin
and negate the mask to find values not in B:
In [5]: A[~A['x'].isin(B['x'])] Out[5]: x 0 0 1 1 2 2
The accepted answer gives a so called LEFT JOIN IF NULL
in SQL terms. If you want all the rows except the matching ones from both DataFrames, not only left. You have to add another condition to the filter, since you want to exclude all rows which are in both
.
In this case we use DataFrame.merge
& DataFrame.query
:
df1 = pd.DataFrame({'A':list('abcde')}) df2 = pd.DataFrame({'A':list('cdefgh')}) print(df1, '\n') print(df2) A 0 a # <- only df1 1 b # <- only df1 2 c # <- both 3 d # <- both 4 e # <- both A 0 c # both 1 d # both 2 e # both 3 f # <- only df2 4 g # <- only df2 5 h # <- only df2
df = ( df1.merge(df2, on='A', how='outer', indicator=True) .query('_merge != "both"') .drop(columns='_merge') ) print(df) A 0 a 1 b 5 f 6 g 7 h
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With