I am trying to convert a string with characters that require multiple hex values like this:
'Mahou Shoujo Madoka\xe2\x98\x85Magica'
to its unicode representation:
'Mahou Shoujo Madoka★Magica'
When I print the string, it tries to evaluate each hex value separately, so by default I get this:
x = 'Mahou Shoujo Madoka\xe2\x98\x85Magica'
print(x)
Mahou Shoujo MadokaâMagica
so I have tried some other StackOverflow answers, such as Best way to convert string to bytes in Python 3?:
x = 'Mahou Shoujo Madoka\xe2\x98\x85Magica'
z = x.encode('utf-8')
print('z:', z)
y = z.decode('utf-8')
print('y:', y)
z: b'Mahou Shoujo Madoka\xc3\xa2\xc2\x98\xc2\x85Magica'
y: Mahou Shoujo MadokaâMagica
Python: Convert Unicode-Hex-String to Unicode:
z = 'Mahou Shoujo Madoka\xe2\x98\x85Magica'
x = binascii.unhexlify(binascii.hexlify(z.encode('utf-8'))).decode('utf-8')
print('x:', x)
x: Mahou Shoujo MadokaâMagica
And some others, but none of them worked. Most of the results I found were people who had a double backslash problem, but none of them had my exact problem.
What I notice is that when I do str.encode, it seems to add some extra values into the binary (such as the difference between z and x in the first attempt), and I'm not quite sure why.
So I tried manually typing in the characters of the string into the binary:
x = b'Mahou Shoujo Madoka\xe2\x98\x85Magica'
x.decode('utf-8')
'Mahou Shoujo Madoka★Magica'
and it worked. But I couldn't find a way to convert from a string to a binary literally other than typing it out. Where am I going wrong?
In Python 3 your original string is a Unicode string, but contains Unicode code points that look like UTF-8, but decoded incorrectly. To fix it:
>>> s = 'Mahou Shoujo Madoka\xe2\x98\x85Magica'
>>> type(s)
<class 'str'>
>>> s.encode('latin1')
b'Mahou Shoujo Madoka\xe2\x98\x85Magica'
>>> s.encode('latin1').decode('utf8')
'Mahou Shoujo Madoka★Magica'
The latin1
encoding happens to map 1:1 to the first 256 code points in Unicode, so .encode('latin1')
translates the code points directly back to bytes. Then you can .decode('utf8')
the bytes properly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With