Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas analogue of JOIN with WHERE clause

Tags:

python

sql

pandas

I'm doing joining of two dataframe (A and B) in python's pandas.

The goal is to receive all the pure rows from B (sql analogue- right join B on A.client_id=B.client_id where A.client_id is null)

In pandas all I know for this operation is to do merging but I don't know how to set up the conditions (where clause):

x=pd.merge(A,B,how='right',on=['client_id','client_id']
like image 229
Keithx Avatar asked Nov 29 '16 14:11

Keithx


People also ask

What are the four types of join in Pandas?

Inner Join. Left Outer Join. Right Outer Join. Full Outer Join or simply Outer Join.

Is join or merge faster Pandas?

The Fastest Ways As it turns out, join always tends to perform well, and merge will perform almost exactly the same given the syntax is optimal.

Do Pandas inner join?

An INNER JOIN between two pandas DataFrames will result into a set of records that have a mutual value in the specified joining column(s). In order to perform an inner join between two DataFrames using a single column, all we need is to provide the on argument when calling merge() .

How do I join two DataFrames in Pandas based on column?

We can join columns from two Dataframes using the merge() function. This is similar to the SQL 'join' functionality. A detailed discussion of different join types is given in the SQL lesson. You specify the type of join you want using the how parameter.


2 Answers

option 1
indicator=True

A.merge(B, on='client_id', how='right', indicator=True) \
    .query('_merge == "right_only"').drop('_merge', 1)

setup

A = pd.DataFrame(dict(client_id=[1, 2, 3], valueA=[4, 5, 6]))
B = pd.DataFrame(dict(client_id=[3, 4, 5], valueB=[7, 8, 9]))

results

enter image description here

more explanation
indicator=True puts another column in the results of the merge that indicates whether that rows results are from the left, right, or both.

A.merge(B, on='client_id', how='outer', indicator=True)

enter image description here

So, I just use query to filter out the right_only indicator then drop that column.


option 2
not really a merge. You can use query again to only pull rows of B where its 'client_id's are not in A

B.query('client_id not in @A.client_id')

or an equivalent way of saying the same thing (but faster)

B[~B.client_id.isin(A.client_id)]

enter image description here

like image 59
piRSquared Avatar answered Oct 17 '22 20:10

piRSquared


For me, this is also a bit unsatisfying, but I think the recommended way is something like:

x = pd.merge(A[A["client_ID"].isnull()], B, 
             how='right', on=['client_id', 'client_id'])

More information can be found in the pandas documentation

Additionally, you might use something like A.where(A["client_ID"].isnull()) for filtering. Also, note my mistake in the previous version. I was comparing to Nonebut you should use the isnull() function

like image 6
Quickbeam2k1 Avatar answered Oct 17 '22 20:10

Quickbeam2k1