Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Evaluate UTF-8 literal escape sequences in a string in Python3

I have a string of the form:

s = '\\xe2\\x99\\xac'

I would like to convert this to the character ♬ by evaluating the escape sequence. However, everything I've tried either results in an error or prints out garbage. How can I force Python to convert the escape sequence into a literal unicode character?

What I've read elsewhere suggests that the following line of code should do what I want, but it results in a UnicodeEncodeError.

print(bytes(s, 'utf-8').decode('unicode-escape'))

I also tried the following, which has the same result:

import codecs
print(codecs.getdecoder('unicode_escape')(s)[0])

Both of these approaches produce the string 'â\x99¬', which print is subsequently unable to handle.

In case it makes any difference the string is being read in from a UTF-8 encoded file and will ultimately be output to a different UTF-8 encoded file after processing.

like image 830
Altay_H Avatar asked Oct 11 '14 05:10

Altay_H


People also ask

How do you escape a string literal in Python?

In Python strings, the backslash "\" is a special character, also called the "escape" character. It is used in representing certain whitespace characters: "\t" is a tab, "\n" is a newline, and "\r" is a carriage return. Conversely, prefixing a special character with "\" turns it into an ordinary character.

How do you escape in Python 3?

In Python strings, the backslash “ ” is a special character, also called the “escape” character. It is used in representing certain whitespace characters: “\t” is a tab, “\n” is a new line, and “\r” is a carriage return. Finally, “ ” can be used to escape itself: “\” is the literal backslash character.

What is a literal string escape sequence?

An escape sequence is a sequence of characters that does not represent itself when used inside a character or string literal, but is translated into another character or a sequence of characters that may be difficult or impossible to represent directly.


1 Answers

...decode('unicode-escape') will give you string '\xe2\x99\xac'.

>>> s = '\\xe2\\x99\\xac'
>>> s.encode().decode('unicode-escape')
'â\x99¬'
>>> _ == '\xe2\x99\xac'
True

You need to decode it. But to decode it, encode it first with latin1 (or iso-8859-1) to preserve the bytes.

>>> s = '\\xe2\\x99\\xac'
>>> s.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
'♬'
like image 197
falsetru Avatar answered Sep 30 '22 15:09

falsetru