I work in Python and would like to read user input (from command line) in Unicode format, ie a Unicode equivalent of <code>raw_input</code>? Also, I would like to test Unicode strings for equality and it looks like a standard <code>==</code> does not work.

<code>raw_input()</code> returns strings as encoded by the OS or UI facilities. The difficulty is knowing which is that decoding. You might attempt the following: <pre class="prettyprint"><code>import sys, locale text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True)) </code></pre> which should work correctly in most of the cases. We need more data about not working Unicode comparisons in order to help you. However, it might be a matter of normalization. Consider the following: <pre class="prettyprint"><code>>>> a1= u'\xeatre' >>> a2= u'e\u0302tre' </code></pre> <code>a1</code> and <code>a2</code> are equivalent but not equal: <pre class="prettyprint"><code>>>> print a1, a2 être être >>> print a1 == a2 False </code></pre> So you might want to use the <code>unicodedata.normalize()</code> method: <pre class="prettyprint"><code>>>> import unicodedata as ud >>> ud.normalize('NFC', a1) u'\xeatre' >>> ud.normalize('NFC', a2) u'\xeatre' >>> ud.normalize('NFC', a1) == ud.normalize('NFC', a2) True </code></pre> If you give us more information, we might be able to help you more, though.

How to read Unicode input and compare Unicode strings in Python?

2 Answers

raw_input() returns strings as encoded by the OS or UI facilities. The difficulty is knowing which is that decoding. You might attempt the following:

import sys, locale text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True))

which should work correctly in most of the cases.

We need more data about not working Unicode comparisons in order to help you. However, it might be a matter of normalization. Consider the following:

>>> a1= u'\xeatre' >>> a2= u'e\u0302tre'

a1 and a2 are equivalent but not equal:

>>> print a1, a2 être être >>> print a1 == a2 False

So you might want to use the unicodedata.normalize() method:

>>> import unicodedata as ud >>> ud.normalize('NFC', a1) u'\xeatre' >>> ud.normalize('NFC', a2) u'\xeatre' >>> ud.normalize('NFC', a1) == ud.normalize('NFC', a2) True

If you give us more information, we might be able to help you more, though.

171

answered Sep 22 '22 17:09

tzot

It should work. raw_input returns a byte string which you must decode using the correct encoding to get your unicode object. For example, the following works for me under Python 2.5 / Terminal.app / OSX:

>>> bytes = raw_input() 日本語 Ελληνικά >>> bytes '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e \xce\x95\xce\xbb\xce\xbb\xce\xb7\xce\xbd\xce\xb9\xce\xba\xce\xac'  >>> uni = bytes.decode('utf-8') # substitute the encoding of your terminal if it's not utf-8 >>> uni u'\u65e5\u672c\u8a9e \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ac'  >>> print uni 日本語 Ελληνικά

As for comparing unicode strings: can you post an example where the comparison doesn't work?

answered Sep 23 '22 17:09

dF.

Related questions
                            
                                New style formatting with tuple as argument
                            
                                How can unrar a file with python
                            
                                react routing and django url conflict
                            
                                HTTP requests.post timeout
                            
                                Standard python interpreter has a vi command mode?
                            
                                Numbers passed as command line arguments in python not interpreted as integers
                            
                                eval to import a module
                            
                                Python Pandas does not read the first row of csv file
                            
                                pandas combine two strings ignore nan values
                            
                                Pandas - Case when & default in pandas
                            
                                How do I find the Windows common application data folder using Python?
                            
                                python's sum() and non-integer values
                            
                                Get Queue Size in Pika (AMQP Python)
                            
                                ndimage missing from scipy
                            
                                Upload files using SFTP in Python, but create directories if path doesn't exist
                            
                                numpy array row major and column major
                            
                                NoReverseMatch at /rest-auth/password/reset/
                            
                                Find all combinations of a list of numbers with a given sum
                            
                                Argparse optional boolean [duplicate]
                            
                                Search for a value anywhere in a pandas DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read Unicode input and compare Unicode strings in Python?

Tags:

python

unicode

python-2.7

alexpeter

People also ask

2 Answers

tzot

dF.

Recent Activity

Donate For Us