I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).
I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.
I am familiar with R and this is the code I would use to do this in R.
library("dplyr") ## Filtering join to remove values already in "TableA" from "TableB" FilteredTableB <- anti_join(TableB,TableA, by = "Key") ## Append "FilteredTableB" to "TableA" CombinedTable <- bind_rows(TableA,FilteredTableB)
How would I achieve this in python?
We can use the '~' operator on the semi-join. It results in anti-join.
An anti-join is when you would like to keep all of the records in the original table except those records that match the other table.
One of the join kinds available in the Merge dialog box in Power Query is a left anti join, which brings in only rows from the left table that don't have any matching rows from the right table. More information: Merge operations overview. Figure shows a table on the left with Date, CountryID, and Units columns.
Semi-join Pandas Semi-joins: 1. Returns the intersection of two tables, similar to an inner join. 2. Returns only the columns from the left table, not the right.
indicator = True
in merge
command will tell you which join was applied by creating new column _merge
with three possible values:
left_only
right_only
both
Keep right_only
and left_only
. That is it.
outer_join = TableA.merge(TableB, how = 'outer', indicator = True) anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)
easy!
Here is a comparison with a solution from piRSquared:
1) When run on this example matching based on one column, piRSquared's solution is faster.
2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.
So it's up for you to decide.
Consider the following dataframes
TableA = pd.DataFrame(np.random.rand(4, 3), pd.Index(list('abcd'), name='Key'), ['A', 'B', 'C']).reset_index() TableB = pd.DataFrame(np.random.rand(4, 3), pd.Index(list('aecf'), name='Key'), ['A', 'B', 'C']).reset_index()
TableA
TableB
This is one way to do what you want
# Identify what values are in TableB and not in TableA key_diff = set(TableB.Key).difference(TableA.Key) where_diff = TableB.Key.isin(key_diff) # Slice TableB accordingly and append to TableA TableA.append(TableB[where_diff], ignore_index=True)
rows = [] for i, row in TableB.iterrows(): if row.Key not in TableA.Key.values: rows.append(row) pd.concat([TableA.T] + rows, axis=1).T
4 rows with 2 overlap
Method 1 is much quicker
10,000 rows 5,000 overlap
loops are bad
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With