From what I understand about a left outer join, the resulting table should never have more rows than the left table...Please let me know if this is wrong...
My left table is 192572 rows and 8 columns.
My right table is 42160 rows and 5 columns.
My Left table has a field called 'id' which matches with a column in my right table called 'key'.
Therefore I merge them as such:
combined = pd.merge(a,b,how='left',left_on='id',right_on='key')
But then the combined shape is 236569.
What am I misunderstanding?
There are two line items for ID 1003 in the second table, so the result of the join will be 2 line items. So, if your secondary tables have more than one row for the key you're joining with, then the result of the join will be multiple rows, resulting in more rows than the left table.
As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.
A left join, or left merge, keeps every row from the left dataframe. Result from left-join or left-merge of two dataframes in Pandas. Rows in the left dataframe that have no corresponding join value in the right dataframe are left with NaN values.
Outer join is also called Full Outer Join that returns all rows from both pandas DataFrames. Where join expression doesn't match it returns null on respective cells.
You can expect this to increase if keys match more than one row in the other DataFrame:
In [11]: df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B']) In [12]: df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C']) In [13]: df.merge(df2, how='left') # merges on columns A Out[13]: A B C 0 1 3 5 1 1 3 6 2 2 4 NaN
To avoid this behaviour drop the duplicates in df2:
In [21]: df2.drop_duplicates(subset=['A']) # you can use take_last=True Out[21]: A C 0 1 5 In [22]: df.merge(df2.drop_duplicates(subset=['A']), how='left') Out[22]: A B C 0 1 3 5 1 2 4 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With