Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting octet strings to Unicode strings, Python 3

I'm trying to convert a string with octal-escaped Unicode back into a proper Unicode string as follows, using Python 3:

"training\345\256\214\346\210\220\345\276\214.txt" is the read-in string.

"training完成後.txt" is the string's actual representation, which I'm trying to obtain.

However, after skimming SO, seems the suggested solution was the following most everywhere I could find for Python 3:

decoded_string = bytes(myString, "utf-8").decode("unicode_escape")

Unfortunately, that seems to yield the wrong Unicode string when applied to my sample:

'trainingå®Â\x8cæÂ\x88Â\x90å¾Â\x8c.txt'

This seems easy to do with byte literals, as well as in Python 2, but unfortunately doesn't seem as easy with strings in Python 3. Help much appreciated, thanks! :)

like image 408
coltonoscopy Avatar asked Feb 02 '26 22:02

coltonoscopy


1 Answers

Assuming your starting string is a Unicode string with literal backslashes, you first need a byte string to use the unicode-escape codec, but the octal escapes are UTF-8, so you'll need to convert it again to a byte string and then decode as UTF-8:

>>> s = r'training\345\256\214\346\210\220\345\276\214.txt'
>>> s
'training\\345\\256\\214\\346\\210\\220\\345\\276\\214.txt'
>>> s.encode('latin1')
b'training\\345\\256\\214\\346\\210\\220\\345\\276\\214.txt'
>>> s.encode('latin1').decode('unicode-escape')
'trainingå®\x8cæ\x88\x90å¾\x8c.txt'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'training\xe5\xae\x8c\xe6\x88\x90\xe5\xbe\x8c.txt'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'training完成後.txt'

Note that the latin1 codec does a direct translation of Unicode codepoints U+0000 to U+00FF to bytes 00-FF.

like image 199
Mark Tolonen Avatar answered Feb 05 '26 13:02

Mark Tolonen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!