Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't Python upcase special characters with upper()?

Tags:

python

unicode

I don't get that:

'ô TRAM'.upper() != 'Ô TRAM'
'ô TRAM'.upper() == 'ô TRAM'

All text editors (including vim and emacs) convert 'ô TRAM'.upper() to 'Ô TRAM' when we ask for upcase. Why does Python seem to only upcase [a-zA-Z] characters? And what is the workaround?

like image 658
Olivier Pons Avatar asked Dec 04 '22 00:12

Olivier Pons


2 Answers

In Python 3, which uses Unicode by default, it should works.

In Python 2, you have to force it, this will do the trick:

u'ô TRAM'.upper()

u prevents the text to translate to ASCII. (remaining as unicode)

like image 140
Thanakron Tandavas Avatar answered Dec 09 '22 15:12

Thanakron Tandavas


What @Thanakon pointed out briefly is correct: You can do this on a Unicode String.

You did ask why Python does not do this on "narrow" strings, though. The reason is: Unicode is a really huge thing -- in terms of memory and processing. It is definitely not trivial. Take a look at the Unicode definition or the implementation of the ICU Library.

When Python was conceived, back in the early 90s, Unicode on strings was not a big issue yet. For the Python community backwards compatibility has always been a big concern. It therefore would have been very difficult to just do "unicode upcasing on narrow strings" in some 2.x version.

But other people were not satisfied with this solution in the 2000s, so they invented a new data type: unicode. If you put your data in there you get the full-fledged Unicode features. There are other modules for your convenience, too...

Oh, and by the way: The narrow string you showed has to be interpreted in a codepage before it makes sense to upcase it in unicode-ish. As your strings displays here it is one of many encoding interpretations (ISO-8859-1, maybe?)

But now the good thing: In Python 3 they decided it's worth to break the backwards compatibility. The default string is a Unicode String then! When you write 'hello' in Python 3 that is the same as in Python 2 u'hello'. And on that you get Unicode functionality.

Either way, in Python 2 with u'blah' or Python 3 'blah' you have to make sure the python-file is saved in UTF-8 (or similar). In Python 3 it is the standard encoding of *.py-files, In Python 2 you have to add an header-line # -*- coding: utf-8 -*- containing the files encoding, or make sure your editor writes the UTF-8 BOM mark.

like image 24
towi Avatar answered Dec 09 '22 14:12

towi