Let's say that I have two tables: people_all
and people_usa
, both with the same structure and therefore the same primary key.
How can I get a table of the people not in the USA? In SQL I'd do something like:
select a.*
from people_all a
left outer join people_usa u
on a.id = u.id
where u.id is null
What would be the Python equivalent? I cannot think of a way to translate this where statement into pandas syntax.
The only way I can think of is to add an arbitrary field to people_usa (e.g. people_usa['dummy']=1
), do a left join, then take only the records where 'dummy' is nan, then delete the dummy field - which seems a bit convoluted.
Thanks!
Line 21: We filter the uncommon rows from the above two DataFrames. We use the concat() method to do so. In this method, we input DataFrames in a list as a parameter to it and remove duplicate rows from the resultant data frame using the drop_duplicates() method.
The main distinction between loc and iloc is: loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position).
Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows. You can also write the above statement with a variable.
pandas get rows. We can use .loc [] to get rows. Note the square brackets here instead of the parenthesis (). The syntax is like this: df.loc [row, column]. column is optional, and if left blank, we can get the entire row. Because Python uses a zero-based index, df.loc [0] returns the first row of the dataframe.
Table 1 illustrates the output of the Python console and shows that our exemplifying data is made of six rows and three columns. This example shows how to get rows of a pandas DataFrame that have a certain value in a column of this DataFrame. In this specific example, we are selecting all rows where the column x3 is equal to the value 1.
0 Getting dataframe records that do not exist in second data frame 0 Look for value in df1('col1') is equal to any value in df2('col3') and remove row from df1 if True [Python] 1 Comparing two different dataframes of different sizes using Pandas
df.columns gives the list of the column (header) names. df.shape shows the dimension of the dataframe, in this case it’s 4 rows by 5 columns. There are several ways to get columns in pandas.
use isin
and negate the boolean mask:
people_usa[~people_usa['ID'].isin(people_all ['ID'])]
Example:
In [364]:
people_all = pd.DataFrame({ 'ID' : np.arange(5)})
people_usa = pd.DataFrame({ 'ID' : [3,4,6,7,100]})
people_usa[~people_usa['ID'].isin(people_all['ID'])]
Out[364]:
ID
2 6
3 7
4 100
so 3 and 4 are removed from the result, the boolean mask looks like this:
In [366]:
people_usa['ID'].isin(people_all['ID'])
Out[366]:
0 True
1 True
2 False
3 False
4 False
Name: ID, dtype: bool
using ~
inverts the mask
Here is another similar to SQL Pandas method: .query():
people_all.query('ID not in @people_usa.ID')
or using NumPy's in1d() method:
people_all.[~np.in1d(people_all, people_usa)]
NOTE: for those who have experience with SQL it might be worth to read Pandas comparison with SQL
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With