Let's say that I have two tables: <code>people_all</code> and <code>people_usa</code>, both with the same structure and therefore the same primary key. How can I get a table of the people not in the USA? In SQL I'd do something like: <pre class="prettyprint"><code>select a.* from people_all a left outer join people_usa u on a.id = u.id where u.id is null </code></pre> What would be the Python equivalent? I cannot think of a way to translate this where statement into pandas syntax. The only way I can think of is to add an arbitrary field to people_usa (e.g. <code>people_usa['dummy']=1</code>), do a left join, then take only the records where 'dummy' is nan, then delete the dummy field - which seems a bit convoluted. Thanks!

Here is another similar to SQL Pandas method: .query(): <pre class="prettyprint"><code>people_all.query('ID not in @people_usa.ID') </code></pre> or using NumPy's in1d() method: <pre class="prettyprint"><code>people_all.[~np.in1d(people_all, people_usa)] </code></pre> NOTE: for those who have experience with SQL it might be worth to read Pandas comparison with SQL

python pandas: how to find rows in one dataframe but not in another?

Tags:

python

pandas

dataframe

Let's say that I have two tables: people_all and people_usa, both with the same structure and therefore the same primary key.

How can I get a table of the people not in the USA? In SQL I'd do something like:

select a.*
from people_all a

left outer join people_usa u
on a.id = u.id

where u.id is null

What would be the Python equivalent? I cannot think of a way to translate this where statement into pandas syntax.

The only way I can think of is to add an arbitrary field to people_usa (e.g. people_usa['dummy']=1), do a left join, then take only the records where 'dummy' is nan, then delete the dummy field - which seems a bit convoluted.

Thanks!

428

asked Sep 18 '15 12:09

Pythonista anonymous

2 Answers

use isin and negate the boolean mask:

people_usa[~people_usa['ID'].isin(people_all ['ID'])]

Example:

In [364]:
people_all = pd.DataFrame({ 'ID' : np.arange(5)})
people_usa = pd.DataFrame({ 'ID' : [3,4,6,7,100]})
people_usa[~people_usa['ID'].isin(people_all['ID'])]

Out[364]:
    ID
2    6
3    7
4  100

so 3 and 4 are removed from the result, the boolean mask looks like this:

In [366]:
people_usa['ID'].isin(people_all['ID'])

Out[366]:
0     True
1     True
2    False
3    False
4    False
Name: ID, dtype: bool

using ~ inverts the mask

125

answered Oct 11 '22 05:10

EdChum

Here is another similar to SQL Pandas method: .query():

people_all.query('ID not in @people_usa.ID')

or using NumPy's in1d() method:

people_all.[~np.in1d(people_all, people_usa)]

NOTE: for those who have experience with SQL it might be worth to read Pandas comparison with SQL

answered Oct 11 '22 06:10

MaxU - stop WAR against UA

Related questions
                            
                                pymongo default database connection
                            
                                How to set value of a ManyToMany field in Django?
                            
                                Matplotlib colorbar background and label placement
                            
                                How to set settings.LOGIN_URL to a view function name in Django 1.5+
                            
                                Django ModelChoiceField has no plus button
                            
                                Matplotlib imshow/matshow display values on plot
                            
                                Mayavi points3d with different size and colors
                            
                                Python mysql (using pymysql) auto reconnect
                            
                                PyDrive: cannot write file to specific GDrive folder
                            
                                Dot-boxplots from DataFrames
                            
                                run selenium with crontab (python)
                            
                                Indexes of elements in NumPy array that satisfy conditions on the value and the index
                            
                                Why is pandas.Series.std() different from numpy.std()?
                            
                                Calculate Mahalanobis distance using NumPy only
                            
                                find_package() errors during installing package via pip
                            
                                Python 2 - How would you round up/down to the nearest 6 minutes?
                            
                                Python using ZIP64 extensions when compressing large files
                            
                                Splitting columns of a numpy array easily
                            
                                Iterate over deque in python
                            
                                Using variables in the format() function in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With