From what I understand about a left outer join, the resulting table should never have more rows than the left table...Please let me know if this is wrong... My left table is 192572 rows and 8 columns. My right table is 42160 rows and 5 columns. My Left table has a field called 'id' which matches with a column in my right table called 'key'. Therefore I merge them as such: <pre class="prettyprint"><code>combined = pd.merge(a,b,how='left',left_on='id',right_on='key') </code></pre> But then the combined shape is 236569. What am I misunderstanding?

You can expect this to increase if keys match more than one row in the other DataFrame: <pre class="prettyprint"><code>In [11]: df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B']) In [12]: df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C']) In [13]: df.merge(df2, how='left') # merges on columns A Out[13]: A B C 0 1 3 5 1 1 3 6 2 2 4 NaN </code></pre> To avoid this behaviour drop the duplicates in df2: <pre class="prettyprint"><code>In [21]: df2.drop_duplicates(subset=['A']) # you can use take_last=True Out[21]: A C 0 1 5 In [22]: df.merge(df2.drop_duplicates(subset=['A']), how='left') Out[22]: A B C 0 1 3 5 1 2 4 NaN </code></pre>

Pandas Left Outer Join results in table larger than left table

Tags:

python

pandas

From what I understand about a left outer join, the resulting table should never have more rows than the left table...Please let me know if this is wrong...

My left table is 192572 rows and 8 columns.

My right table is 42160 rows and 5 columns.

My Left table has a field called 'id' which matches with a column in my right table called 'key'.

Therefore I merge them as such:

combined = pd.merge(a,b,how='left',left_on='id',right_on='key')

But then the combined shape is 236569.

What am I misunderstanding?

251

asked Mar 28 '14 18:03

Terence Chow

1 Answers

You can expect this to increase if keys match more than one row in the other DataFrame:

In [11]: df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])  In [12]: df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C'])  In [13]: df.merge(df2, how='left')  # merges on columns A Out[13]:     A  B   C 0  1  3   5 1  1  3   6 2  2  4 NaN

To avoid this behaviour drop the duplicates in df2:

In [21]: df2.drop_duplicates(subset=['A'])  # you can use take_last=True Out[21]:     A  C 0  1  5  In [22]: df.merge(df2.drop_duplicates(subset=['A']), how='left') Out[22]:     A  B   C 0  1  3   5 1  2  4 NaN

150

answered Sep 19 '22 23:09

Andy Hayden

Related questions
                            
                                How can I check if a string contains ANY letters from the alphabet?
                            
                                Testing email sending in Django [closed]
                            
                                django.core.exceptions.ImproperlyConfigured: Error loading MySQLdb module: No module named MySQLdb
                            
                                Loop that also accesses previous and next values
                            
                                Python, HTTPS GET with basic authentication
                            
                                Changing user agent on urllib2.urlopen
                            
                                How to install MySQLdb (Python data access library to MySQL) on Mac OS X?
                            
                                Using an SSH keyfile with Fabric
                            
                                Python list rotation [duplicate]
                            
                                IOError: [Errno 32] Broken pipe when piping: `prog.py | othercmd`
                            
                                Python: import module from another directory at the same level in project hierarchy
                            
                                How to reliably guess the encoding between MacRoman, CP1252, Latin1, UTF-8, and ASCII
                            
                                If range() is a generator in Python 3.3, why can I not call next() on a range?
                            
                                setuptools: package data folder location
                            
                                How to use Python type hints with Django QuerySet?
                            
                                Is there any difference between django.conf.settings and import settings?
                            
                                How to add an element to the beginning of an OrderedDict?
                            
                                Python Method overriding, does signature matter?
                            
                                Convert 2d numpy array into list of lists [duplicate]
                            
                                data type not understood

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With