Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas - Merging on string columns not working (bug?)

I'm trying to do a simple merge between two dataframes. These come from two different SQL tables, where the joining keys are strings:

>>> df1.col1.dtype dtype('O') >>> df2.col2.dtype dtype('O') 

I try to merge them using this:

>>> merge_res = pd.merge(df1, df2, left_on='col1', right_on='col2') 

The result of the inner join is empty, which first prompted me that there might not be any entries in the intersection:

>>> merge_res.shape (0, 19) 

But when I try to match a single element, I see this really odd behavior.

# Pick random element in second dataframe >>> df2.iloc[5,:].col2 '95498208100000'  # Manually look for it in the first dataframe >>> df1[df1.col1 == '95498208100000'] 0 rows × 19 columns # Empty, which makes sense given the above merge result  # Now look for the same value as an integer >>> df1[df1.col1 == 95498208100000] 1 rows × 19 columns # FINDS THE ELEMENT!?! 

So, the columns are defined with the 'object' dtype. Searching for them as strings don't yield any results. Searching for them as integers does return a result, and I think this is the reason why the merge doesn't work above..

Any ideas what's going on?

It's almost as thought Pandas converts df1.col1 to an integer just because it can, even though it should be treated as a string while matching.

(I tried to replicate this using sample dataframes, but for small examples, I don't see this behavior. Any suggestions on how I can find a more descriptive example would be appreciated as well.)

like image 732
user1496984 Avatar asked Sep 19 '16 22:09

user1496984


People also ask

Can you merge columns in Pandas?

To merge two pandas DataFrames on multiple columns use pandas. merge() method. merge() is considered more versatile and flexible and we also have the same method in DataFrame.

How does merging work in Pandas?

INNER MergePandas uses “inner” merge by default. This keeps only the common values in both the left and right dataframes for the merged data. In our case, only the rows that contain use_id values that are common between user_usage and user_device remain in the merged data — inner_merge.

How do I merge column values in Pandas?

To start, you may use this template to concatenate your column values (for strings only): df['New Column Name'] = df['1st Column Name'] + df['2nd Column Name'] + ... Notice that the plus symbol ('+') is used to perform the concatenation.

Is Pandas merge efficient?

Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.


2 Answers

The issue was that the object dtype is misleading. I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and leaving the remainders as strings.

The solution was to make sure that every field is a string:

>>> df1.col1 = df1.col1.astype(str) >>> df2.col2 = df2.col2.astype(str) 

Then the merge works as expected.

(I wish there was a way of specifying a dtype of str...)

like image 105
user1496984 Avatar answered Oct 14 '22 14:10

user1496984


I ran into a case where the df.col = df.col.astype(str) solution did not work. Turns out the problem was in the encoding.

My original data looked like this:

In [72]: df1['col1'][:3] Out[73]:               col1 0  dustin pedroia 1  kevin youkilis 2     david ortiz  In [72]: df2['col2'][:3] Out[73]:               col2 0  dustin pedroia 1  kevin youkilis 2     david ortiz 

And after using .astype(str) the merge still wasn't working so I executed the following:

df1.col1 = df1.col1.str.encode('utf-8') df2.col2 = df2.col2.str.encode('utf-8') 

and was able to find the difference:

In [95]: df1 Out[95]:                         col1 0  b'dustin\xc2\xa0pedroia' 1  b'kevin\xc2\xa0youkilis' 2     b'david\xc2\xa0ortiz'  In [95]: df2 Out[95]:                  col2 0  b'dustin pedroia' 1  b'kevin youkilis' 2     b'david ortiz' 

At which point all I had to do was run df1.col1 = df1.col1.str.replace('\xa0',' ') on the decoded df1.col1 variable (i.e. before running .str.encode('utf-8')) and the merge worked perfectly.

NOTE: Regardless of what I was replacing I always used .str.encode('utf-8') to check whether it worked.

Alternatively

Using regular expressions and the Variable Explorer in the Spyder IDE for Anaconda I found the following difference.

import re #places the raw string into a list df1.col1 = df1.col1.apply(lambda x: re.findall(x, x))   df2.col2 = df2.col2.apply(lambda x: re.findall(x, x)) 

where my df1 data turned into this (copied and pasted from Spyder):

['dustin\xa0pedroia'] ['kevin\xa0youkilis'] ['david\xa0ortiz'] 

which just has a slightly different solution. I don't know in what case the first example wouldn't work and the second would but I wanted to provide both just in case someone runs into it :)

like image 41
seeiespi Avatar answered Oct 14 '22 14:10

seeiespi