I'm trying to do a simple merge between two dataframes. These come from two different SQL tables, where the joining keys are strings: <pre class="prettyprint"><code>>>> df1.col1.dtype dtype('O') >>> df2.col2.dtype dtype('O') </code></pre> I try to merge them using this: <pre class="prettyprint"><code>>>> merge_res = pd.merge(df1, df2, left_on='col1', right_on='col2') </code></pre> The result of the inner join is empty, which first prompted me that there might not be any entries in the intersection: <pre class="prettyprint"><code>>>> merge_res.shape (0, 19) </code></pre> But when I try to match a single element, I see this really odd behavior. <pre class="prettyprint"><code># Pick random element in second dataframe >>> df2.iloc[5,:].col2 '95498208100000' # Manually look for it in the first dataframe >>> df1[df1.col1 == '95498208100000'] 0 rows × 19 columns # Empty, which makes sense given the above merge result # Now look for the same value as an integer >>> df1[df1.col1 == 95498208100000] 1 rows × 19 columns # FINDS THE ELEMENT!?! </code></pre> So, the columns are defined with the 'object' dtype. Searching for them as strings don't yield any results. Searching for them as integers does return a result, and I think this is the reason why the merge doesn't work above.. Any ideas what's going on? It's almost as thought Pandas converts <code>df1.col1</code> to an integer just because it can, even though it should be treated as a string while matching. (I tried to replicate this using sample dataframes, but for small examples, I don't see this behavior. Any suggestions on how I can find a more descriptive example would be appreciated as well.)

The issue was that the <code>object</code> dtype is misleading. I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and leaving the remainders as strings. The solution was to make sure that every field is a string: <pre class="prettyprint"><code>>>> df1.col1 = df1.col1.astype(str) >>> df2.col2 = df2.col2.astype(str) </code></pre> Then the merge works as expected. (I wish there was a way of specifying a <code>dtype</code> of <code>str</code>...)

I ran into a case where the <code>df.col = df.col.astype(str)</code> solution did not work. Turns out the problem was in the encoding. My original data looked like this: <pre class="prettyprint"><code>In [72]: df1['col1'][:3] Out[73]: col1 0 dustin pedroia 1 kevin youkilis 2 david ortiz In [72]: df2['col2'][:3] Out[73]: col2 0 dustin pedroia 1 kevin youkilis 2 david ortiz </code></pre> And after using <code>.astype(str)</code> the merge still wasn't working so I executed the following: <pre class="prettyprint"><code>df1.col1 = df1.col1.str.encode('utf-8') df2.col2 = df2.col2.str.encode('utf-8') </code></pre> and was able to find the difference: <pre class="prettyprint"><code>In [95]: df1 Out[95]: col1 0 b'dustin\xc2\xa0pedroia' 1 b'kevin\xc2\xa0youkilis' 2 b'david\xc2\xa0ortiz' In [95]: df2 Out[95]: col2 0 b'dustin pedroia' 1 b'kevin youkilis' 2 b'david ortiz' </code></pre> At which point all I had to do was run <code>df1.col1 = df1.col1.str.replace('\xa0',' ')</code> on the decoded df1.col1 variable (i.e. before running <code>.str.encode('utf-8')</code>) and the merge worked perfectly. NOTE: Regardless of what I was replacing I always used <code>.str.encode('utf-8')</code> to check whether it worked. Alternatively Using regular expressions and the Variable Explorer in the Spyder IDE for Anaconda I found the following difference. <pre class="prettyprint"><code>import re #places the raw string into a list df1.col1 = df1.col1.apply(lambda x: re.findall(x, x)) df2.col2 = df2.col2.apply(lambda x: re.findall(x, x)) </code></pre> where my df1 data turned into this (copied and pasted from Spyder): <pre class="prettyprint"><code>['dustin\xa0pedroia'] ['kevin\xa0youkilis'] ['david\xa0ortiz'] </code></pre> which just has a slightly different solution. I don't know in what case the first example wouldn't work and the second would but I wanted to provide both just in case someone runs into it :)

pandas - Merging on string columns not working (bug?)

Tags:

python

merge

pandas

mysql

I'm trying to do a simple merge between two dataframes. These come from two different SQL tables, where the joining keys are strings:

>>> df1.col1.dtype dtype('O') >>> df2.col2.dtype dtype('O')

I try to merge them using this:

>>> merge_res = pd.merge(df1, df2, left_on='col1', right_on='col2')

The result of the inner join is empty, which first prompted me that there might not be any entries in the intersection:

>>> merge_res.shape (0, 19)

But when I try to match a single element, I see this really odd behavior.

# Pick random element in second dataframe >>> df2.iloc[5,:].col2 '95498208100000'  # Manually look for it in the first dataframe >>> df1[df1.col1 == '95498208100000'] 0 rows × 19 columns # Empty, which makes sense given the above merge result  # Now look for the same value as an integer >>> df1[df1.col1 == 95498208100000] 1 rows × 19 columns # FINDS THE ELEMENT!?!

So, the columns are defined with the 'object' dtype. Searching for them as strings don't yield any results. Searching for them as integers does return a result, and I think this is the reason why the merge doesn't work above..

Any ideas what's going on?

It's almost as thought Pandas converts df1.col1 to an integer just because it can, even though it should be treated as a string while matching.

(I tried to replicate this using sample dataframes, but for small examples, I don't see this behavior. Any suggestions on how I can find a more descriptive example would be appreciated as well.)

732

asked Sep 19 '16 22:09

user1496984

2 Answers

The issue was that the object dtype is misleading. I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and leaving the remainders as strings.

The solution was to make sure that every field is a string:

>>> df1.col1 = df1.col1.astype(str) >>> df2.col2 = df2.col2.astype(str)

Then the merge works as expected.

(I wish there was a way of specifying a dtype of str...)

105

answered Oct 14 '22 14:10

user1496984

I ran into a case where the df.col = df.col.astype(str) solution did not work. Turns out the problem was in the encoding.

My original data looked like this:

In [72]: df1['col1'][:3] Out[73]:               col1 0  dustin pedroia 1  kevin youkilis 2     david ortiz  In [72]: df2['col2'][:3] Out[73]:               col2 0  dustin pedroia 1  kevin youkilis 2     david ortiz

And after using .astype(str) the merge still wasn't working so I executed the following:

df1.col1 = df1.col1.str.encode('utf-8') df2.col2 = df2.col2.str.encode('utf-8')

and was able to find the difference:

In [95]: df1 Out[95]:                         col1 0  b'dustin\xc2\xa0pedroia' 1  b'kevin\xc2\xa0youkilis' 2     b'david\xc2\xa0ortiz'  In [95]: df2 Out[95]:                  col2 0  b'dustin pedroia' 1  b'kevin youkilis' 2     b'david ortiz'

At which point all I had to do was run df1.col1 = df1.col1.str.replace('\xa0',' ') on the decoded df1.col1 variable (i.e. before running .str.encode('utf-8')) and the merge worked perfectly.

NOTE: Regardless of what I was replacing I always used .str.encode('utf-8') to check whether it worked.

Alternatively

Using regular expressions and the Variable Explorer in the Spyder IDE for Anaconda I found the following difference.

import re #places the raw string into a list df1.col1 = df1.col1.apply(lambda x: re.findall(x, x))   df2.col2 = df2.col2.apply(lambda x: re.findall(x, x))

where my df1 data turned into this (copied and pasted from Spyder):

['dustin\xa0pedroia'] ['kevin\xa0youkilis'] ['david\xa0ortiz']

which just has a slightly different solution. I don't know in what case the first example wouldn't work and the second would but I wanted to provide both just in case someone runs into it :)

answered Oct 14 '22 14:10

seeiespi

Related questions
                            
                                Can python doctest ignore some output lines?
                            
                                Download a spreadsheet from Google Docs using Python
                            
                                Is it possible to get a list of keywords in Python?
                            
                                How do you create a legend for a contour plot in matplotlib?
                            
                                Are Python sets mutable?
                            
                                How to solve pkg_resources.VersionConflict error during bin/python bootstrap.py -d
                            
                                ArgumentError: relationship expects a class or mapper argument
                            
                                Why is printf() giving a strange output in python?
                            
                                Docker ENV for Python variables
                            
                                Deploying Google Analytics With Django
                            
                                How can I convert Unicode to uppercase to print it?
                            
                                How do I concatenate files in Python?
                            
                                How to Query model where name contains any word in python list?
                            
                                Better way to shuffle two related lists
                            
                                Python insert numpy array into sqlite3 database
                            
                                Why are tuples constructed from differently initialized sets equal?
                            
                                Excluding a top-level directory from a setuptools package
                            
                                Force Python to forego native sqlite3 and use the (installed) latest sqlite3 version
                            
                                How to convert country names to ISO 3166-1 alpha-2 values, using python
                            
                                Dictionary keys and values to separate numpy arrays

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pandas - Merging on string columns not working (bug?)

Tags:

python

merge

pandas

mysql

user1496984

People also ask

2 Answers

user1496984

seeiespi

Recent Activity

Donate For Us