I have a UTF-8 encoded string that comes from somewhere else that contains the characters \xc3\x85lesund
(literal backslash, literal "x", literal "c", etc).
Printing it outputs the following:
\xc3\x85lesund
I want to convert it to a bytes variable:
b'\xc3\x85lesund'
To be able to encode:
'Ålesund'
How can I do this? I'm using python 3.4.
uses sixteen bits (2 bytes) per character, allowing for 65,536 unique characters. it is an international character set, containing symbols and characters from many languages.
In particular, the \n escape sequence represents the newline character. A \n in a printf format string tells awk to start printing output at the beginning of a newline.
Escape sequences allow you to include special characters in strings. To do this, simply add a backslash ( \ ) before the character you want to escape.
unicode_escape
TL;DR You can decode bytes using the unicode_escape
encoding to convert \xXX
and \uXXXX
escape sequences to the corresponding characters:
>>> r'\xc3\x85lesund'.encode('utf-8').decode('unicode_escape').encode('latin-1')
b'\xc3\x85lesund'
First, encode the string to bytes so it can be decoded:
>>> r'\xc3\x85あ'.encode('utf-8')
b'\\xc3\\x85\xe3\x81\x82'
(I changed the string to show that this process works even for characters outside of Latin-1.)
Here's how each character is encoded (note that あ is encoded into multiple bytes):
\
(U+005C) -> 0x5cx
(U+0078) -> 0x78c
(U+0063) -> 0x633
(U+0033) -> 0x33\
(U+005C) -> 0x5cx
(U+0078) -> 0x788
(U+0038) -> 0x385
(U+0035) -> 0x35あ
(U+3042) -> 0xe3, 0x81, 0x82Next, decode the bytes as unicode_escape
to replace each escape sequence with its corresponding character:
>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape')
'Ã\x85ã\x81\x82'
Each escape sequence is converted to a separate character; each byte that is not part of an escape sequence is converted to the character with the corresponding ordinal value:
\\xc3
-> U+00C3\\x85
-> U+0085\xe3
-> U+00E3\x81
-> U+0081\x82
-> U+0082Finally, encode the string to bytes again:
>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape').encode('latin-1')
b'\xc3\x85\xe3\x81\x82'
Encoding as Latin-1 simply converts each character to its ordinal value:
And voilà, we have the byte sequence you're looking for.
codecs.escape_decode
As an alternative, you can use the codecs.escape_decode
method to interpret escape sequences in a bytes to bytes conversion, as user19087 posted in an answer to a similar question:
>>> import codecs
>>> codecs.escape_decode(r'\xc3\x85lesund'.encode('utf-8'))[0]
b'\xc3\x85lesund'
However, codecs.escape_decode
is undocumented, so I wouldn't recommend using it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With