Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exact same text strings not matching

I have two columns in a dataframe title and store containing text strings by which I want to subset the dataframe:

In [84]:
    2631              coffee‑mate sugar free french ...  jet.com
    2633            nestle coffeemate natural bliss ...  jet.com
    2634         coffee‑mate liquid coffee creamer, ...  jet.com
    3085                       coffee‑mate hazelnut ...  jet.com

When I try :

df[(df.title.str.contains('coffee-mate')) & (df.store.str.contains('jet.com'))]

I get:

Out[84]: 
Empty DataFrame
Columns: [title, store]
Index: []

However, when I do this:

df[(df.title.str.contains('coffee')) & (df.store.str.contains('jet.com'))]

I get:

    2631              coffee‑mate sugar free french ...  jet.com
    2633            nestle coffeemate natural bliss ...  jet.com
    2634         coffee‑mate liquid coffee creamer, ...  jet.com
    3085                       coffee‑mate hazelnut ...  jet.com

I don't know what to make of this !

I tried copying the characters 'coffee-mate' to do an equivalency test and got False.

'coffee‑mate' == 'coffee-mate'
Out[92]: False

I have a feeling this is something to do with encoding but don't know how to detect and fix the issue. Can someone help?

like image 493
vagabond Avatar asked Jul 01 '17 20:07

vagabond


1 Answers

The "coffee-mate" in your dataframe uses a non-breaking hyphen (u"\u2011"), and your search string uses a hyphen

Non breaking http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E2%80%91&mode=char

Your hyphen http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=-&mode=char

While they look the same to you and me, Python considers them two different characters. If you have this issue in the future, I solved this just by copy pasting the character into this UTF8 tool - you were wise to run a comparison of coffee-mate and coffee‑mate

like image 89
Parker Avatar answered Nov 09 '22 04:11

Parker