I have a UTF-8 encoded string that comes from somewhere else that contains the characters <code>\xc3\x85lesund</code> (literal backslash, literal "x", literal "c", etc). Printing it outputs the following: <pre class="prettyprint"><code>\xc3\x85lesund </code></pre> I want to convert it to a bytes variable: <pre class="prettyprint"><code>b'\xc3\x85lesund' </code></pre> To be able to encode: <pre class="prettyprint"><code>'Ålesund' </code></pre> How can I do this? I'm using python 3.4.

<h3>Using <code>unicode_escape</code> </h3> TL;DR You can decode bytes using the <code>unicode_escape</code> encoding to convert <code>\xXX</code> and <code>\uXXXX</code> escape sequences to the corresponding characters: <pre class="prettyprint"><code>>>> r'\xc3\x85lesund'.encode('utf-8').decode('unicode_escape').encode('latin-1') b'\xc3\x85lesund' </code></pre> <hr> First, encode the string to bytes so it can be decoded: <pre class="prettyprint"><code>>>> r'\xc3\x85あ'.encode('utf-8') b'\\xc3\\x85\xe3\x81\x82' </code></pre> (I changed the string to show that this process works even for characters outside of Latin-1.) Here's how each character is encoded (note that あ is encoded into multiple bytes): <ul> <li> <code>\</code> (U+005C) -> 0x5c</li> <li> <code>x</code> (U+0078) -> 0x78</li> <li> <code>c</code> (U+0063) -> 0x63</li> <li> <code>3</code> (U+0033) -> 0x33</li> <li> <code>\</code> (U+005C) -> 0x5c</li> <li> <code>x</code> (U+0078) -> 0x78</li> <li> <code>8</code> (U+0038) -> 0x38</li> <li> <code>5</code> (U+0035) -> 0x35</li> <li> <code>あ</code> (U+3042) -> 0xe3, 0x81, 0x82</li> </ul> Next, decode the bytes as <code>unicode_escape</code> to replace each escape sequence with its corresponding character: <pre class="prettyprint"><code>>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape') 'Ã\x85ã\x81\x82' </code></pre> Each escape sequence is converted to a separate character; each byte that is not part of an escape sequence is converted to the character with the corresponding ordinal value: <ul> <li> <code>\\xc3</code> -> U+00C3</li> <li> <code>\\x85</code> -> U+0085</li> <li> <code>\xe3</code> -> U+00E3</li> <li> <code>\x81</code> -> U+0081</li> <li> <code>\x82</code> -> U+0082</li> </ul> Finally, encode the string to bytes again: <pre class="prettyprint"><code>>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape').encode('latin-1') b'\xc3\x85\xe3\x81\x82' </code></pre> Encoding as Latin-1 simply converts each character to its ordinal value: <ul> <li>U+00C3 -> 0xc3</li> <li>U+0085 -> 0x85</li> <li>U+00E3 -> 0xe3</li> <li>U+0081 -> 0x81</li> <li>U+0082 -> 0x82</li> </ul> And voilà, we have the byte sequence you're looking for. <h3>Using <code>codecs.escape_decode</code> </h3> As an alternative, you can use the <code>codecs.escape_decode</code> method to interpret escape sequences in a bytes to bytes conversion, as user19087 posted in an answer to a similar question: <pre class="prettyprint"><code>>>> import codecs >>> codecs.escape_decode(r'\xc3\x85lesund'.encode('utf-8'))[0] b'\xc3\x85lesund' </code></pre> However, <code>codecs.escape_decode</code> is undocumented, so I wouldn't recommend using it.

How can I convert literal escape sequences in a string to the corresponding bytes? [duplicate]

Tags:

python

python-3.x

encoding

I have a UTF-8 encoded string that comes from somewhere else that contains the characters \xc3\x85lesund (literal backslash, literal "x", literal "c", etc).

Printing it outputs the following:

\xc3\x85lesund

I want to convert it to a bytes variable:

b'\xc3\x85lesund'

To be able to encode:

'Ålesund'

How can I do this? I'm using python 3.4.

376

asked Jan 09 '17 16:01

Rafael Almeida

1 Answers

Using `unicode_escape`

TL;DR You can decode bytes using the unicode_escape encoding to convert \xXX and \uXXXX escape sequences to the corresponding characters:

>>> r'\xc3\x85lesund'.encode('utf-8').decode('unicode_escape').encode('latin-1')
b'\xc3\x85lesund'

First, encode the string to bytes so it can be decoded:

>>> r'\xc3\x85あ'.encode('utf-8')
b'\\xc3\\x85\xe3\x81\x82'

(I changed the string to show that this process works even for characters outside of Latin-1.)

Here's how each character is encoded (note that あ is encoded into multiple bytes):

\ (U+005C) -> 0x5c
x (U+0078) -> 0x78
c (U+0063) -> 0x63
3 (U+0033) -> 0x33
\ (U+005C) -> 0x5c
x (U+0078) -> 0x78
8 (U+0038) -> 0x38
5 (U+0035) -> 0x35
あ (U+3042) -> 0xe3, 0x81, 0x82

Next, decode the bytes as unicode_escape to replace each escape sequence with its corresponding character:

>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape')
'Ã\x85ã\x81\x82'

Each escape sequence is converted to a separate character; each byte that is not part of an escape sequence is converted to the character with the corresponding ordinal value:

\\xc3 -> U+00C3
\\x85 -> U+0085
\xe3 -> U+00E3
\x81 -> U+0081
\x82 -> U+0082

Finally, encode the string to bytes again:

>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape').encode('latin-1')
b'\xc3\x85\xe3\x81\x82'

Encoding as Latin-1 simply converts each character to its ordinal value:

U+00C3 -> 0xc3
U+0085 -> 0x85
U+00E3 -> 0xe3
U+0081 -> 0x81
U+0082 -> 0x82

And voilà, we have the byte sequence you're looking for.

Using `codecs.escape_decode`

As an alternative, you can use the codecs.escape_decode method to interpret escape sequences in a bytes to bytes conversion, as user19087 posted in an answer to a similar question:

>>> import codecs
>>> codecs.escape_decode(r'\xc3\x85lesund'.encode('utf-8'))[0]
b'\xc3\x85lesund'

However, codecs.escape_decode is undocumented, so I wouldn't recommend using it.

120

answered Oct 08 '22 07:10

ThisSuitIsBlackNot

Related questions
                            
                                Google Drive API - ImportError: cannot import name util
                            
                                pandas replace part of a column with another column
                            
                                Why python bulit-in functions such as sum(),max(),min() can be used to calculate the numpy's datatype ndarray?
                            
                                Which is the more efficient way to choose a random pair of objects from a list of lists or tuples?
                            
                                Cannot catch ConnectionError with requests
                            
                                Check if mail is read, gmail api
                            
                                Flask RestPlus inherit model doesn't work as expected
                            
                                How to compare tensor inside tensorflow?
                            
                                single-step simulation in tensorflow RNN
                            
                                Debug python application running in Docker
                            
                                Unable to import opencv in Jupyter notebook but able to import in command line on Anaconda
                            
                                Slice MultiIndex pandas DataFrame by position
                            
                                Read Directory of Timeseries CSV data efficiently with Dask DataFrame and Pandas
                            
                                How to do linear regression using Python and Scikit learn using one hot encoding?
                            
                                Match same number of repetitions of character as repetitions of captured group
                            
                                Pandas DataFrame to Excel: Vertical Alignment of Index
                            
                                Invoking the lock screen using python
                            
                                Scrapy - Continuously fetch urls to crawl from database
                            
                                Large memory Python background jobs
                            
                                RandomizedSearchCV gives different results using the same random_state

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I convert literal escape sequences in a string to the corresponding bytes? [duplicate]

Tags:

python

python-3.x

encoding

Rafael Almeida

People also ask

1 Answers

Using `unicode_escape`

Using `codecs.escape_decode`

ThisSuitIsBlackNot

Recent Activity

Donate For Us

How can I convert literal escape sequences in a string to the corresponding bytes? [duplicate]

Tags:

python

python-3.x

encoding

Rafael Almeida

People also ask

1 Answers

Using unicode_escape

Using codecs.escape_decode

ThisSuitIsBlackNot

Related questions

Recent Activity

Donate For Us

Using `unicode_escape`

Using `codecs.escape_decode`