I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B). I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables. I am familiar with R and this is the code I would use to do this in R. <pre class="prettyprint"><code>library("dplyr") ## Filtering join to remove values already in "TableA" from "TableB" FilteredTableB <- anti_join(TableB,TableA, by = "Key") ## Append "FilteredTableB" to "TableA" CombinedTable <- bind_rows(TableA,FilteredTableB) </code></pre> How would I achieve this in python?

<code>indicator = True</code> in <code>merge</code> command will tell you which join was applied by creating new column <code>_merge</code> with three possible values: <ul> <li><code>left_only</code></li> <li><code>right_only</code></li> <li><code>both</code></li> </ul> Keep <code>right_only</code> and <code>left_only</code>. That is it. <pre class="prettyprint"><code>outer_join = TableA.merge(TableB, how = 'outer', indicator = True) anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1) </code></pre> easy! Here is a comparison with a solution from piRSquared: 1) When run on this example matching based on one column, piRSquared's solution is faster. 2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column. So it's up for you to decide. <img src="https://i.stack.imgur.com/skKRE.png" alt="enter image description here">

Anti-Join Pandas

Tags:

python

merge

pandas

dataframe

anti-join

I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).

I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.

I am familiar with R and this is the code I would use to do this in R.

library("dplyr")  ## Filtering join to remove values already in "TableA" from "TableB" FilteredTableB <- anti_join(TableB,TableA, by = "Key")  ## Append "FilteredTableB" to "TableA" CombinedTable <- bind_rows(TableA,FilteredTableB)

How would I achieve this in python?

702

asked Jul 22 '16 01:07

Ayelavan

2 Answers

indicator = True in merge command will tell you which join was applied by creating new column _merge with three possible values:

left_only
right_only
both

Keep right_only and left_only. That is it.

outer_join = TableA.merge(TableB, how = 'outer', indicator = True)  anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)

easy!

Here is a comparison with a solution from piRSquared:

1) When run on this example matching based on one column, piRSquared's solution is faster.

2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.

So it's up for you to decide.

enter image description here

124

answered Oct 04 '22 03:10

Dennis Lyubyvy

Consider the following dataframes

TableA = pd.DataFrame(np.random.rand(4, 3),                       pd.Index(list('abcd'), name='Key'),                       ['A', 'B', 'C']).reset_index() TableB = pd.DataFrame(np.random.rand(4, 3),                       pd.Index(list('aecf'), name='Key'),                       ['A', 'B', 'C']).reset_index()

TableA

enter image description here

TableB

enter image description here

This is one way to do what you want

Method 1

# Identify what values are in TableB and not in TableA key_diff = set(TableB.Key).difference(TableA.Key) where_diff = TableB.Key.isin(key_diff)  # Slice TableB accordingly and append to TableA TableA.append(TableB[where_diff], ignore_index=True)

enter image description here

Method 2

rows = [] for i, row in TableB.iterrows():     if row.Key not in TableA.Key.values:         rows.append(row)  pd.concat([TableA.T] + rows, axis=1).T

Timing

4 rows with 2 overlap

Method 1 is much quicker

enter image description here

10,000 rows 5,000 overlap

loops are bad

enter image description here

answered Oct 04 '22 04:10

piRSquared

Related questions
                            
                                Python multi-line with statement
                            
                                Contains of HashSet<Integer> in Python
                            
                                Printing out actual error message for ValueError
                            
                                What is the difference between sparse_categorical_crossentropy and categorical_crossentropy?
                            
                                Python: delete element from heap
                            
                                Inserting new records with one-to-many relationship in sqlalchemy
                            
                                How to set default text for a Tkinter Entry widget
                            
                                How do I attach a remote debugger to a Python process?
                            
                                How to obtain sheet names from XLS files without loading the whole file?
                            
                                Django DetailView - how to use 'request' in get_context_data
                            
                                Running R script from python
                            
                                Principal components analysis using pandas dataframe
                            
                                networkx - change color/width according to edge attributes - inconsistent result
                            
                                Using a pip cache directory in docker builds
                            
                                matplotlib has no attribute 'pyplot'
                            
                                How to pass dictionary as command line argument to Python script?
                            
                                Relations on composite keys using sqlalchemy
                            
                                How to plot a gradient color line in matplotlib?
                            
                                Django: Get current user in model save
                            
                                How to specify python requests http put body?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Anti-Join Pandas

Tags:

python

merge

pandas

dataframe

anti-join

Ayelavan

People also ask

2 Answers

Dennis Lyubyvy

Method 1

Method 2

Timing

piRSquared

Recent Activity

Donate For Us