Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 and upper()

I want to transform UTF-8 strings using built-in functions such as upper() and capitalize().

For example:

>>> mystring = "işğüı"
>>> print mystring.upper()
Işğüı  # should be İŞĞÜI instead.

How can I fix this?

like image 546
Hellnar Avatar asked Feb 23 '10 00:02

Hellnar


2 Answers

Do not perform actions on encoded strings; decode to unicode first.

>>> mystring = "işğüı"
>>> print mystring.decode('utf-8').upper()
IŞĞÜI
like image 147
Ignacio Vazquez-Abrams Avatar answered Nov 04 '22 07:11

Ignacio Vazquez-Abrams


It's actually best, as a general strategy, to always keep your text as Unicode once it's in memory: decode it at the moment it's input, and encode it exactly at the moment you need to output it, if there are specific encoding requirements at input and/or input times.

Even if you don't choose to adopt this general strategy (and you should!), the only sound way to perform the task you require is still to decode, process, encode again -- never to work on the encoded forms. I.e.:

mystring = "işğüı"
print mystring.decode('utf-8').upper().encode('utf-8')

assuming you're constrained to encoded strings at assignment and for output purposes. (The output constraint is unfortunately realistic, the assignment constraint isn't -- just do mystring = u"işğüı", making it unicode from the start, and save yourself at least the .decode call!-)

like image 39
Alex Martelli Avatar answered Nov 04 '22 09:11

Alex Martelli