I'm currently learning Python and as a Slovenian I often use UTF-8 characters to test my programs. Normally everything works fine, but there is one catch that I can't overtake. Even though I've got encoding declared on the top of the file it fails when I try to reverse a string containing special characters
#-*- coding: utf-8 -*-
a = "čšž"
print a #prints čšž
b = a[::-1]
print b #prints �šō� instead of žšč
Is there any way to fix that?
Python 2 strings are byte strings, and UTF-8 encoded text uses multiple bytes per character. Just because your terminal manages to interpret the UTF-8 bytes as characters, doesn't mean that Python knows about what bytes form one UTF-8 character.
Your bytestring consists of 6 bytes, every two bytes form one character:
>>> a = "čšž"
>>> a
'\xc4\x8d\xc5\xa1\xc5\xbe'
However, how many bytes UTF-8 uses depends on where in the Unicode standard the character is defined; ASCII characters (the first 128 characters in the Unicode standard) only need 1 byte each, and many emoji need 4 bytes!
In UTF-8 order is everything; reversing the above bytestring reverses the bytes, resulting in some gibberish as far as the UTF-8 standard is concerned, but the middle 4 bytes just happen to be valid UTF-8 sequences (for š
and ō
):
>>> a[::-1]
'\xbe\xc5\xa1\xc5\x8d\xc4'
-----~~~~~~~~^^^^^^^^####
| š ō |
\ \
invalid UTF8 byte opening UTF-8 byte missing a second byte
You'd have to decode the byte string to a unicode
object, which consists of single characters. Reversing that object gives you the right results:
b = a.decode('utf8')[::-1]
print b
You can always encode the object back to UTF-8 again:
b = a.decode('utf8')[::-1].encode('utf8')
Note that in Unicode, you can still run into issues when reversing text, when combining characters are used. Reversing text with combining characters places those combining characters in front rather than after the character they combine with, so they'll combine with the wrong character instead:
>>> print u'e\u0301a'
éa
>>> print u'e\u0301a'[::-1]
áe
You can mostly avoid this by converting the Unicode data to its normalised form (which replaces combinations with 1-codepoint forms) but there are plenty of other exotic Unicode characters that don't play well with string reversals.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With