Exact same text strings not matching

Question

I have two columns in a dataframe title and store containing text strings by which I want to subset the dataframe:

In [84]:
    2631              coffee‑mate sugar free french ...  jet.com
    2633            nestle coffeemate natural bliss ...  jet.com
    2634         coffee‑mate liquid coffee creamer, ...  jet.com
    3085                       coffee‑mate hazelnut ...  jet.com

When I try :

df[(df.title.str.contains('coffee-mate')) & (df.store.str.contains('jet.com'))]

I get:

Out[84]: 
Empty DataFrame
Columns: [title, store]
Index: []

However, when I do this:

df[(df.title.str.contains('coffee')) & (df.store.str.contains('jet.com'))]

I get:

    2631              coffee‑mate sugar free french ...  jet.com
    2633            nestle coffeemate natural bliss ...  jet.com
    2634         coffee‑mate liquid coffee creamer, ...  jet.com
    3085                       coffee‑mate hazelnut ...  jet.com

I don't know what to make of this !

I tried copying the characters 'coffee-mate' to do an equivalency test and got False.

'coffee‑mate' == 'coffee-mate'
Out[92]: False

I have a feeling this is something to do with encoding but don't know how to detect and fix the issue. Can someone help?

Parker · Accepted Answer

The "coffee-mate" in your dataframe uses a non-breaking hyphen (u"\u2011"), and your search string uses a hyphen

Non breaking http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E2%80%91&mode=char

Your hyphen http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=-&mode=char

While they look the same to you and me, Python considers them two different characters. If you have this issue in the future, I solved this just by copy pasting the character into this UTF8 tool - you were wise to run a comparison of coffee-mate and coffee‑mate

Exact same text strings not matching

Tags:

python

string

pandas

dataframe

character-encoding

vagabond

1 Answers

Parker

Recent Activity

Donate For Us

Exact same text strings not matching

Tags:

python

string

pandas

dataframe

character-encoding

vagabond

1 Answers

Parker

Related questions

Recent Activity

Donate For Us