Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3 - String with \xHH Hex Values to Unicode

I am trying to convert a string with characters that require multiple hex values like this:

'Mahou Shoujo Madoka\xe2\x98\x85Magica'

to its unicode representation:

'Mahou Shoujo Madoka★Magica'

When I print the string, it tries to evaluate each hex value separately, so by default I get this:

x = 'Mahou Shoujo Madoka\xe2\x98\x85Magica'
print(x)

Mahou Shoujo MadokaâMagica

so I have tried some other StackOverflow answers, such as Best way to convert string to bytes in Python 3?:

x = 'Mahou Shoujo Madoka\xe2\x98\x85Magica'
z = x.encode('utf-8')
print('z:', z)
y = z.decode('utf-8')
print('y:', y)

z: b'Mahou Shoujo Madoka\xc3\xa2\xc2\x98\xc2\x85Magica'
y: Mahou Shoujo MadokaâMagica

Python: Convert Unicode-Hex-String to Unicode:

z = 'Mahou Shoujo Madoka\xe2\x98\x85Magica'
x = binascii.unhexlify(binascii.hexlify(z.encode('utf-8'))).decode('utf-8')
print('x:', x)

x: Mahou Shoujo MadokaâMagica

And some others, but none of them worked. Most of the results I found were people who had a double backslash problem, but none of them had my exact problem.

What I notice is that when I do str.encode, it seems to add some extra values into the binary (such as the difference between z and x in the first attempt), and I'm not quite sure why.

So I tried manually typing in the characters of the string into the binary:

x = b'Mahou Shoujo Madoka\xe2\x98\x85Magica'
x.decode('utf-8')

'Mahou Shoujo Madoka★Magica'

and it worked. But I couldn't find a way to convert from a string to a binary literally other than typing it out. Where am I going wrong?

like image 899
user14678939 Avatar asked Mar 14 '17 05:03

user14678939


1 Answers

In Python 3 your original string is a Unicode string, but contains Unicode code points that look like UTF-8, but decoded incorrectly. To fix it:

>>> s = 'Mahou Shoujo Madoka\xe2\x98\x85Magica'
>>> type(s)
<class 'str'>
>>> s.encode('latin1')
b'Mahou Shoujo Madoka\xe2\x98\x85Magica'
>>> s.encode('latin1').decode('utf8')
'Mahou Shoujo Madoka★Magica'

The latin1 encoding happens to map 1:1 to the first 256 code points in Unicode, so .encode('latin1') translates the code points directly back to bytes. Then you can .decode('utf8') the bytes properly.

like image 74
Mark Tolonen Avatar answered Sep 27 '22 22:09

Mark Tolonen