I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. Scenario:
us = u'MyString' # is the utf-8 string
Part one of my question, is why does this return False? :
us.encode('utf-8') == "MyString" ## False
Part two - how can I compare within a list comprehension?
myComp = [utfString for utfString in jsonLoadsObj if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O.
EDIT: I'm using Google App Engine, which uses Python 2.7
Here's a more complete example of the problem:
#json coming from remote server: #response object looks like: {"number1":"first", "number2":"second"} data = json.loads(response) k = data.keys() I need something like: myList = [item for item in k if item=="number1"] #### I thought this would work: myList = [item for item in k if item.encode('utf-8')=="number1"]
Python supports the string type and the unicode type. A string is a sequence of chars while a unicode is a sequence of "pointers".
Use Unicode code points in strings: \x , \u , \U Each code is treated as one character. You can check it with the built-in function len() which returns the number of characters.
Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.
String comparison in python can be made both case sensitive(using == or !=) or case insensitive(using lower() or upper().
Normal strings in Python are stored internally as 8-bit ASCII, while Unicode strings are stored as 16-bit Unicode. This allows for a more varied set of characters, including special characters from most languages in the world.
Let us see how to compare Strings in Python. Method 1: Using Relational Operators. The relational operators compare the Unicode values of the characters of the strings from the zeroth index till the end of the string. It then returns a boolean value according to the operator used.
The relational operators compare the Unicode values of the characters of the strings from the zeroth index till the end of the string. It then returns a boolean value according to the operator used. “Geek” == “Geek” will return True as the Unicode of all the characters are equal
There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead: Show activity on this post. You are trying to compare a string of bytes ( 'MyString') with a string of Unicode code points ( u'MyString' ). This is an "apples and oranges" comparison.
You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys()
first:
data = json.loads(response) myList = [item for item in data if item == "number1"]
You may want to use u"number1"
to avoid implicit conversions between Unicode and byte strings:
data = json.loads(response) myList = [item for item in data if item == u"number1"]
Both versions work fine:
>>> import json >>> data = json.loads('{"number1":"first", "number2":"second"}') >>> [item for item in data if item == "number1"] [u'number1'] >>> [item for item in data if item == u"number1"] [u'number1']
Note that in your first example, us
is not a UTF-8 string; it is unicode data, the json
library has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
On Python 2, your expectation that your test returns True
would be correct, you are doing something else wrong:
>>> us = u'MyString' >>> us u'MyString' >>> type(us) <type 'unicode'> >>> us.encode('utf8') == 'MyString' True >>> type(us.encode('utf8')) <type 'str'>
There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead:
myComp = [elem for elem in json_data if elem == u"MyString"]
You are trying to compare a string of bytes ('MyString'
) with a string of Unicode code points (u'MyString'
). This is an "apples and oranges" comparison. Unfortunately, Python 2 pretends in some cases that this comparison is valid, instead of always returning False
:
>>> u'MyString' == 'MyString' # in my opinion should be False True
It's up to you as the designer/developer to decide what the correct comparison should be. Here is one possible way:
a = u'MyString' b = 'MyString' a.encode('UTF-8') == b # True
I recommend the above instead of a == b.decode('UTF-8')
because all u''
style strings can be encoded into bytes with UTF-8, except possibly in some bizarre cases, but not all byte-strings can be decoded to Unicode that way.
But if you choose to do a UTF-8 encode of the Unicode strings before comparing, that will fail for something like this on a Windows system: u'Em dashes\u2014are cool'.encode('UTF-8') == 'Em dashes\x97are cool'
. But if you .encode('Windows-1252')
instead it would succeed. That's why it's an apples and oranges comparison.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With