Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read Unicode input and compare Unicode strings in Python?

I work in Python and would like to read user input (from command line) in Unicode format, ie a Unicode equivalent of raw_input?

Also, I would like to test Unicode strings for equality and it looks like a standard == does not work.

like image 980
alexpeter Avatar asked Jan 25 '09 02:01

alexpeter


People also ask

How do you find the Unicode value of a string in Python?

Use Unicode code points in strings: \x , \u , \U Each code is treated as one character. You can check it with the built-in function len() which returns the number of characters.

What is the difference between Unicode and string in Python?

Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.

What does Unicode () do in Python?

Remarks. If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.

How do I decode a UTF-8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.


2 Answers

raw_input() returns strings as encoded by the OS or UI facilities. The difficulty is knowing which is that decoding. You might attempt the following:

import sys, locale text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True)) 

which should work correctly in most of the cases.

We need more data about not working Unicode comparisons in order to help you. However, it might be a matter of normalization. Consider the following:

>>> a1= u'\xeatre' >>> a2= u'e\u0302tre' 

a1 and a2 are equivalent but not equal:

>>> print a1, a2 être être >>> print a1 == a2 False 

So you might want to use the unicodedata.normalize() method:

>>> import unicodedata as ud >>> ud.normalize('NFC', a1) u'\xeatre' >>> ud.normalize('NFC', a2) u'\xeatre' >>> ud.normalize('NFC', a1) == ud.normalize('NFC', a2) True 

If you give us more information, we might be able to help you more, though.

like image 171
tzot Avatar answered Sep 22 '22 17:09

tzot


It should work. raw_input returns a byte string which you must decode using the correct encoding to get your unicode object. For example, the following works for me under Python 2.5 / Terminal.app / OSX:

>>> bytes = raw_input() 日本語 Ελληνικά >>> bytes '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e \xce\x95\xce\xbb\xce\xbb\xce\xb7\xce\xbd\xce\xb9\xce\xba\xce\xac'  >>> uni = bytes.decode('utf-8') # substitute the encoding of your terminal if it's not utf-8 >>> uni u'\u65e5\u672c\u8a9e \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ac'  >>> print uni 日本語 Ελληνικά 

As for comparing unicode strings: can you post an example where the comparison doesn't work?

like image 43
dF. Avatar answered Sep 23 '22 17:09

dF.