Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I compare a unicode type to a string in python?

I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. Scenario:

us = u'MyString' # is the utf-8 string 

Part one of my question, is why does this return False? :

us.encode('utf-8') == "MyString" ## False 

Part two - how can I compare within a list comprehension?

myComp = [utfString for utfString in jsonLoadsObj            if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O. 

EDIT: I'm using Google App Engine, which uses Python 2.7

Here's a more complete example of the problem:

#json coming from remote server: #response object looks like:  {"number1":"first", "number2":"second"}  data = json.loads(response) k = data.keys()  I need something like: myList = [item for item in k if item=="number1"]    #### I thought this would work: myList = [item for item in k if item.encode('utf-8')=="number1"] 
like image 242
rGil Avatar asked May 09 '13 21:05

rGil


People also ask

Is Unicode the same as string in Python?

Python supports the string type and the unicode type. A string is a sequence of chars while a unicode is a sequence of "pointers".

How do you find the Unicode value of a string in Python?

Use Unicode code points in strings: \x , \u , \U Each code is treated as one character. You can check it with the built-in function len() which returns the number of characters.

What is the difference between string and Unicode?

Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.

How do I compare specific characters in a string in Python?

String comparison in python can be made both case sensitive(using == or !=) or case insensitive(using lower() or upper().

What is the difference between normal and Unicode characters in Python?

Normal strings in Python are stored internally as 8-bit ASCII, while Unicode strings are stored as 16-bit Unicode. This allows for a more varied set of characters, including special characters from most languages in the world.

How to compare strings in Python?

Let us see how to compare Strings in Python. Method 1: Using Relational Operators. The relational operators compare the Unicode values of the characters of the strings from the zeroth index till the end of the string. It then returns a boolean value according to the operator used.

How to compare the Unicode values of strings in SQL?

The relational operators compare the Unicode values of the characters of the strings from the zeroth index till the end of the string. It then returns a boolean value according to the operator used. “Geek” == “Geek” will return True as the Unicode of all the characters are equal

Is it possible to compare strings in UTF-8?

There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead: Show activity on this post. You are trying to compare a string of bytes ( 'MyString') with a string of Unicode code points ( u'MyString' ). This is an "apples and oranges" comparison.


2 Answers

You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys() first:

data = json.loads(response) myList = [item for item in data if item == "number1"]   

You may want to use u"number1" to avoid implicit conversions between Unicode and byte strings:

data = json.loads(response) myList = [item for item in data if item == u"number1"]   

Both versions work fine:

>>> import json >>> data = json.loads('{"number1":"first", "number2":"second"}') >>> [item for item in data if item == "number1"] [u'number1'] >>> [item for item in data if item == u"number1"] [u'number1'] 

Note that in your first example, us is not a UTF-8 string; it is unicode data, the json library has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • The Python Unicode HOWTO

  • Pragmatic Unicode by Ned Batchelder

On Python 2, your expectation that your test returns True would be correct, you are doing something else wrong:

>>> us = u'MyString' >>> us u'MyString' >>> type(us) <type 'unicode'> >>> us.encode('utf8') == 'MyString' True >>> type(us.encode('utf8')) <type 'str'> 

There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead:

myComp = [elem for elem in json_data if elem == u"MyString"] 
like image 105
Martijn Pieters Avatar answered Oct 06 '22 07:10

Martijn Pieters


You are trying to compare a string of bytes ('MyString') with a string of Unicode code points (u'MyString'). This is an "apples and oranges" comparison. Unfortunately, Python 2 pretends in some cases that this comparison is valid, instead of always returning False:

>>> u'MyString' == 'MyString'  # in my opinion should be False True 

It's up to you as the designer/developer to decide what the correct comparison should be. Here is one possible way:

a = u'MyString' b = 'MyString' a.encode('UTF-8') == b  # True 

I recommend the above instead of a == b.decode('UTF-8') because all u'' style strings can be encoded into bytes with UTF-8, except possibly in some bizarre cases, but not all byte-strings can be decoded to Unicode that way.

But if you choose to do a UTF-8 encode of the Unicode strings before comparing, that will fail for something like this on a Windows system: u'Em dashes\u2014are cool'.encode('UTF-8') == 'Em dashes\x97are cool'. But if you .encode('Windows-1252') instead it would succeed. That's why it's an apples and oranges comparison.

like image 28
wberry Avatar answered Oct 06 '22 08:10

wberry