Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python reversing an UTF-8 string

I'm currently learning Python and as a Slovenian I often use UTF-8 characters to test my programs. Normally everything works fine, but there is one catch that I can't overtake. Even though I've got encoding declared on the top of the file it fails when I try to reverse a string containing special characters

#-*- coding: utf-8 -*-

a = "čšž"
print a    #prints čšž
b = a[::-1]
print b    #prints �šō� instead of žšč

Is there any way to fix that?

like image 636
Denis Črnič Avatar asked Dec 01 '15 08:12

Denis Črnič


1 Answers

Python 2 strings are byte strings, and UTF-8 encoded text uses multiple bytes per character. Just because your terminal manages to interpret the UTF-8 bytes as characters, doesn't mean that Python knows about what bytes form one UTF-8 character.

Your bytestring consists of 6 bytes, every two bytes form one character:

>>> a = "čšž"
>>> a
'\xc4\x8d\xc5\xa1\xc5\xbe'

However, how many bytes UTF-8 uses depends on where in the Unicode standard the character is defined; ASCII characters (the first 128 characters in the Unicode standard) only need 1 byte each, and many emoji need 4 bytes!

In UTF-8 order is everything; reversing the above bytestring reverses the bytes, resulting in some gibberish as far as the UTF-8 standard is concerned, but the middle 4 bytes just happen to be valid UTF-8 sequences (for š and ō):

>>> a[::-1]
'\xbe\xc5\xa1\xc5\x8d\xc4'
-----~~~~~~~~^^^^^^^^####
  |     š       ō      |
  \                    \
   invalid UTF8 byte    opening UTF-8 byte missing a second byte

You'd have to decode the byte string to a unicode object, which consists of single characters. Reversing that object gives you the right results:

b = a.decode('utf8')[::-1]
print b

You can always encode the object back to UTF-8 again:

b = a.decode('utf8')[::-1].encode('utf8')

Note that in Unicode, you can still run into issues when reversing text, when combining characters are used. Reversing text with combining characters places those combining characters in front rather than after the character they combine with, so they'll combine with the wrong character instead:

>>> print u'e\u0301a'
éa
>>> print u'e\u0301a'[::-1]
áe

You can mostly avoid this by converting the Unicode data to its normalised form (which replaces combinations with 1-codepoint forms) but there are plenty of other exotic Unicode characters that don't play well with string reversals.

like image 86
Martijn Pieters Avatar answered Sep 22 '22 04:09

Martijn Pieters