Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I convert literal escape sequences in a string to the corresponding bytes? [duplicate]

I have a UTF-8 encoded string that comes from somewhere else that contains the characters \xc3\x85lesund (literal backslash, literal "x", literal "c", etc).

Printing it outputs the following:

\xc3\x85lesund

I want to convert it to a bytes variable:

b'\xc3\x85lesund'

To be able to encode:

'Ålesund'

How can I do this? I'm using python 3.4.

like image 376
Rafael Almeida Avatar asked Jan 09 '17 16:01

Rafael Almeida


People also ask

How many bytes is an escape sequence?

uses sixteen bits (2 bytes) per character, allowing for 65,536 unique characters. it is an international character set, containing symbols and characters from many languages.

What does the escape sequence '\ n Do?

In particular, the \n escape sequence represents the newline character. A \n in a printf format string tells awk to start printing output at the beginning of a newline.

How do you handle escape sequence in Python?

Escape sequences allow you to include special characters in strings. To do this, simply add a backslash ( \ ) before the character you want to escape.


1 Answers

Using unicode_escape

TL;DR You can decode bytes using the unicode_escape encoding to convert \xXX and \uXXXX escape sequences to the corresponding characters:

>>> r'\xc3\x85lesund'.encode('utf-8').decode('unicode_escape').encode('latin-1')
b'\xc3\x85lesund'

First, encode the string to bytes so it can be decoded:

>>> r'\xc3\x85あ'.encode('utf-8')
b'\\xc3\\x85\xe3\x81\x82'

(I changed the string to show that this process works even for characters outside of Latin-1.)

Here's how each character is encoded (note that あ is encoded into multiple bytes):

  • \ (U+005C) -> 0x5c
  • x (U+0078) -> 0x78
  • c (U+0063) -> 0x63
  • 3 (U+0033) -> 0x33
  • \ (U+005C) -> 0x5c
  • x (U+0078) -> 0x78
  • 8 (U+0038) -> 0x38
  • 5 (U+0035) -> 0x35
  • (U+3042) -> 0xe3, 0x81, 0x82

Next, decode the bytes as unicode_escape to replace each escape sequence with its corresponding character:

>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape')
'Ã\x85ã\x81\x82'

Each escape sequence is converted to a separate character; each byte that is not part of an escape sequence is converted to the character with the corresponding ordinal value:

  • \\xc3 -> U+00C3
  • \\x85 -> U+0085
  • \xe3 -> U+00E3
  • \x81 -> U+0081
  • \x82 -> U+0082

Finally, encode the string to bytes again:

>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape').encode('latin-1')
b'\xc3\x85\xe3\x81\x82'

Encoding as Latin-1 simply converts each character to its ordinal value:

  • U+00C3 -> 0xc3
  • U+0085 -> 0x85
  • U+00E3 -> 0xe3
  • U+0081 -> 0x81
  • U+0082 -> 0x82

And voilà, we have the byte sequence you're looking for.

Using codecs.escape_decode

As an alternative, you can use the codecs.escape_decode method to interpret escape sequences in a bytes to bytes conversion, as user19087 posted in an answer to a similar question:

>>> import codecs
>>> codecs.escape_decode(r'\xc3\x85lesund'.encode('utf-8'))[0]
b'\xc3\x85lesund'

However, codecs.escape_decode is undocumented, so I wouldn't recommend using it.

like image 120
ThisSuitIsBlackNot Avatar answered Oct 08 '22 07:10

ThisSuitIsBlackNot