I have two columns in a dataframe title
and store
containing text strings by which I want to subset the dataframe:
In [84]:
2631 coffee‑mate sugar free french ... jet.com
2633 nestle coffeemate natural bliss ... jet.com
2634 coffee‑mate liquid coffee creamer, ... jet.com
3085 coffee‑mate hazelnut ... jet.com
When I try :
df[(df.title.str.contains('coffee-mate')) & (df.store.str.contains('jet.com'))]
I get:
Out[84]:
Empty DataFrame
Columns: [title, store]
Index: []
However, when I do this:
df[(df.title.str.contains('coffee')) & (df.store.str.contains('jet.com'))]
I get:
2631 coffee‑mate sugar free french ... jet.com
2633 nestle coffeemate natural bliss ... jet.com
2634 coffee‑mate liquid coffee creamer, ... jet.com
3085 coffee‑mate hazelnut ... jet.com
I don't know what to make of this !
I tried copying the characters 'coffee-mate' to do an equivalency test and got False
.
'coffee‑mate' == 'coffee-mate'
Out[92]: False
I have a feeling this is something to do with encoding but don't know how to detect and fix the issue. Can someone help?
The "coffee-mate" in your dataframe uses a non-breaking hyphen (u"\u2011"
), and your search string uses a hyphen
Non breaking http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E2%80%91&mode=char
Your hyphen http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=-&mode=char
While they look the same to you and me, Python considers them two different characters. If you have this issue in the future, I solved this just by copy pasting the character into this UTF8 tool - you were wise to run a comparison of coffee-mate
and coffee‑mate
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With