I'm trying to convert a string with octal-escaped Unicode back into a proper Unicode string as follows, using Python 3:
"training\345\256\214\346\210\220\345\276\214.txt" is the read-in string.
"training完成後.txt" is the string's actual representation, which I'm trying to obtain.
However, after skimming SO, seems the suggested solution was the following most everywhere I could find for Python 3:
decoded_string = bytes(myString, "utf-8").decode("unicode_escape")
Unfortunately, that seems to yield the wrong Unicode string when applied to my sample:
'trainingå®Â\x8cæÂ\x88Â\x90å¾Â\x8c.txt'
This seems easy to do with byte literals, as well as in Python 2, but unfortunately doesn't seem as easy with strings in Python 3. Help much appreciated, thanks! :)
Assuming your starting string is a Unicode string with literal backslashes, you first need a byte string to use the unicode-escape codec, but the octal escapes are UTF-8, so you'll need to convert it again to a byte string and then decode as UTF-8:
>>> s = r'training\345\256\214\346\210\220\345\276\214.txt'
>>> s
'training\\345\\256\\214\\346\\210\\220\\345\\276\\214.txt'
>>> s.encode('latin1')
b'training\\345\\256\\214\\346\\210\\220\\345\\276\\214.txt'
>>> s.encode('latin1').decode('unicode-escape')
'trainingå®\x8cæ\x88\x90å¾\x8c.txt'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'training\xe5\xae\x8c\xe6\x88\x90\xe5\xbe\x8c.txt'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'training完成後.txt'
Note that the latin1 codec does a direct translation of Unicode codepoints U+0000 to U+00FF to bytes 00-FF.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With