Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Anti-Join Pandas

I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).

I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.

I am familiar with R and this is the code I would use to do this in R.

library("dplyr")  ## Filtering join to remove values already in "TableA" from "TableB" FilteredTableB <- anti_join(TableB,TableA, by = "Key")  ## Append "FilteredTableB" to "TableA" CombinedTable <- bind_rows(TableA,FilteredTableB) 

How would I achieve this in python?

like image 702
Ayelavan Avatar asked Jul 22 '16 01:07

Ayelavan


People also ask

How do you anti join in Python?

We can use the '~' operator on the semi-join. It results in anti-join.

What is anti join?

An anti-join is when you would like to keep all of the records in the original table except those records that match the other table.

What is left anti join?

One of the join kinds available in the Merge dialog box in Power Query is a left anti join, which brings in only rows from the left table that don't have any matching rows from the right table. More information: Merge operations overview. Figure shows a table on the left with Date, CountryID, and Units columns.

What is a semi-join in Pandas?

Semi-join Pandas Semi-joins: 1. Returns the intersection of two tables, similar to an inner join. 2. Returns only the columns from the left table, not the right.


2 Answers

indicator = True in merge command will tell you which join was applied by creating new column _merge with three possible values:

  • left_only
  • right_only
  • both

Keep right_only and left_only. That is it.

outer_join = TableA.merge(TableB, how = 'outer', indicator = True)  anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)   

easy!

Here is a comparison with a solution from piRSquared:

1) When run on this example matching based on one column, piRSquared's solution is faster.

2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.

So it's up for you to decide.

enter image description here

like image 124
Dennis Lyubyvy Avatar answered Oct 04 '22 03:10

Dennis Lyubyvy


Consider the following dataframes

TableA = pd.DataFrame(np.random.rand(4, 3),                       pd.Index(list('abcd'), name='Key'),                       ['A', 'B', 'C']).reset_index() TableB = pd.DataFrame(np.random.rand(4, 3),                       pd.Index(list('aecf'), name='Key'),                       ['A', 'B', 'C']).reset_index() 

TableA 

enter image description here


TableB 

enter image description here

This is one way to do what you want

Method 1

# Identify what values are in TableB and not in TableA key_diff = set(TableB.Key).difference(TableA.Key) where_diff = TableB.Key.isin(key_diff)  # Slice TableB accordingly and append to TableA TableA.append(TableB[where_diff], ignore_index=True) 

enter image description here

Method 2

rows = [] for i, row in TableB.iterrows():     if row.Key not in TableA.Key.values:         rows.append(row)  pd.concat([TableA.T] + rows, axis=1).T 

Timing

4 rows with 2 overlap

Method 1 is much quicker

enter image description here

10,000 rows 5,000 overlap

loops are bad

enter image description here

like image 23
piRSquared Avatar answered Oct 04 '22 04:10

piRSquared