Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Left Outer Join results in table larger than left table

Tags:

python

pandas

From what I understand about a left outer join, the resulting table should never have more rows than the left table...Please let me know if this is wrong...

My left table is 192572 rows and 8 columns.

My right table is 42160 rows and 5 columns.

My Left table has a field called 'id' which matches with a column in my right table called 'key'.

Therefore I merge them as such:

combined = pd.merge(a,b,how='left',left_on='id',right_on='key') 

But then the combined shape is 236569.

What am I misunderstanding?

like image 251
Terence Chow Avatar asked Mar 28 '14 18:03

Terence Chow


People also ask

Why LEFT join increases number of rows?

There are two line items for ID 1003 in the second table, so the result of the join will be 2 line items. So, if your secondary tables have more than one row for the key you're joining with, then the result of the join will be multiple rows, resulting in more rows than the left table.

Is join or merge faster pandas?

As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.

How does LEFT join work pandas?

A left join, or left merge, keeps every row from the left dataframe. Result from left-join or left-merge of two dataframes in Pandas. Rows in the left dataframe that have no corresponding join value in the right dataframe are left with NaN values.

What does outer join do in pandas?

Outer join is also called Full Outer Join that returns all rows from both pandas DataFrames. Where join expression doesn't match it returns null on respective cells.


1 Answers

You can expect this to increase if keys match more than one row in the other DataFrame:

In [11]: df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])  In [12]: df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C'])  In [13]: df.merge(df2, how='left')  # merges on columns A Out[13]:     A  B   C 0  1  3   5 1  1  3   6 2  2  4 NaN 

To avoid this behaviour drop the duplicates in df2:

In [21]: df2.drop_duplicates(subset=['A'])  # you can use take_last=True Out[21]:     A  C 0  1  5  In [22]: df.merge(df2.drop_duplicates(subset=['A']), how='left') Out[22]:     A  B   C 0  1  3   5 1  2  4 NaN 
like image 150
Andy Hayden Avatar answered Sep 19 '22 23:09

Andy Hayden