I work in Python and would like to read user input (from command line) in Unicode format, ie a Unicode equivalent of raw_input
?
Also, I would like to test Unicode strings for equality and it looks like a standard ==
does not work.
Use Unicode code points in strings: \x , \u , \U Each code is treated as one character. You can check it with the built-in function len() which returns the number of characters.
Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.
Remarks. If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.
To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.
raw_input()
returns strings as encoded by the OS or UI facilities. The difficulty is knowing which is that decoding. You might attempt the following:
import sys, locale text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True))
which should work correctly in most of the cases.
We need more data about not working Unicode comparisons in order to help you. However, it might be a matter of normalization. Consider the following:
>>> a1= u'\xeatre' >>> a2= u'e\u0302tre'
a1
and a2
are equivalent but not equal:
>>> print a1, a2 être être >>> print a1 == a2 False
So you might want to use the unicodedata.normalize()
method:
>>> import unicodedata as ud >>> ud.normalize('NFC', a1) u'\xeatre' >>> ud.normalize('NFC', a2) u'\xeatre' >>> ud.normalize('NFC', a1) == ud.normalize('NFC', a2) True
If you give us more information, we might be able to help you more, though.
It should work. raw_input
returns a byte string which you must decode using the correct encoding to get your unicode
object. For example, the following works for me under Python 2.5 / Terminal.app / OSX:
>>> bytes = raw_input() 日本語 Ελληνικά >>> bytes '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e \xce\x95\xce\xbb\xce\xbb\xce\xb7\xce\xbd\xce\xb9\xce\xba\xce\xac' >>> uni = bytes.decode('utf-8') # substitute the encoding of your terminal if it's not utf-8 >>> uni u'\u65e5\u672c\u8a9e \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ac' >>> print uni 日本語 Ελληνικά
As for comparing unicode strings: can you post an example where the comparison doesn't work?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With