Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

List of unicode strings

Tags:

python

unicode

If I have a list of unicode strings

lst = [ u"aaa", u"bbb", u"foo", u"bar", ... u"baz", u"zzz" ]

is it necessary to write a prefix u before every string? Can I make a construction that says that every element of lst will be unicode string and then write it without u prefix?

like image 476
xralf Avatar asked Feb 01 '12 14:02

xralf


2 Answers

In Python 2.7 (also Python 2.6) you can make unicode literals the default for a module:

from __future__ import unicode_literals

You must include the import at the top of the file, and it then applies to all string literals in the file. Use a b prefix to force byte strings:

>>> from __future__ import unicode_literals
>>> "sss"
u'sss'
>>> b"x"
'x'
like image 60
Duncan Avatar answered Oct 06 '22 03:10

Duncan


If your intention is to convert a set of standard strings to unicode, you could map that function onto your list:

lst = ["aaa", "bbb", "ccc"]
map(unicode, lst)

Which gives

[u"aaa", u"bbb", u"ccc"]

If however lst contains a non ASCII character string, you'll have to prefix that particular string with the u. If you don't, you'll get this error on the conversion:

lst = ["\xe4"]
map(unicode,lst)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

As noted in the comments, this answer is different for Python 2.x or 3.x. In Python 3, everything changes:

Everything you thought you knew about binary data and Unicode has changed. Python 3.0 uses the concepts of text and (binary) data instead of Unicode strings and 8-bit strings. All text is Unicode; however encoded Unicode is represented as binary data. The type used to hold text is str, the type used to hold data is bytes. The biggest difference with the 2.x situation is that any attempt to mix text and data in Python 3.0 raises TypeError, whereas if you were to mix Unicode and 8-bit strings in Python 2.x, it would work if the 8-bit string happened to contain only 7-bit (ASCII) bytes, but you would get UnicodeDecodeError if it contained non-ASCII values. This value-specific behavior has caused numerous sad faces over the years.

like image 42
Hooked Avatar answered Oct 06 '22 03:10

Hooked