Sometimes when I get input from a file or the user, I get a string with escape sequences in it. I would like to process the escape sequences in the same way that Python processes escape sequences in string literals. For example, let's say <code>myString</code> is defined as: <pre class="prettyprint"><code>>>> myString = "spam\\neggs" >>> print(myString) spam\neggs </code></pre> I want a function (I'll call it <code>process</code>) that does this: <pre class="prettyprint"><code>>>> print(process(myString)) spam eggs </code></pre> It's important that the function can process all of the escape sequences in Python (listed in a table in the link above). Does Python have a function to do this?

The correct thing to do is use the 'string-escape' code to decode the string. <pre class="prettyprint"><code>>>> myString = "spam\\neggs" >>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3 >>> decoded_string = myString.decode('string_escape') # python2 >>> print(decoded_string) spam eggs </code></pre> Don't use the AST or eval. Using the string codecs is much safer.

Process escape sequences in a string in Python

Tags:

python

string

escaping

Sometimes when I get input from a file or the user, I get a string with escape sequences in it. I would like to process the escape sequences in the same way that Python processes escape sequences in string literals.

For example, let's say myString is defined as:

>>> myString = "spam\\neggs" >>> print(myString) spam\neggs

I want a function (I'll call it process) that does this:

>>> print(process(myString)) spam eggs

It's important that the function can process all of the escape sequences in Python (listed in a table in the link above).

Does Python have a function to do this?

996

asked Oct 26 '10 03:10

dln385

2 Answers

The correct thing to do is use the 'string-escape' code to decode the string.

>>> myString = "spam\\neggs" >>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3  >>> decoded_string = myString.decode('string_escape') # python2 >>> print(decoded_string) spam eggs

Don't use the AST or eval. Using the string codecs is much safer.

116

answered Oct 13 '22 19:10

Jerub

`unicode_escape` doesn't work in general

It turns out that the string_escape or unicode_escape solution does not work in general -- particularly, it doesn't work in the presence of actual Unicode.

If you can be sure that every non-ASCII character will be escaped (and remember, anything beyond the first 128 characters is non-ASCII), unicode_escape will do the right thing for you. But if there are any literal non-ASCII characters already in your string, things will go wrong.

unicode_escape is fundamentally designed to convert bytes into Unicode text. But in many places -- for example, Python source code -- the source data is already Unicode text.

The only way this can work correctly is if you encode the text into bytes first. UTF-8 is the sensible encoding for all text, so that should work, right?

The following examples are in Python 3, so that the string literals are cleaner, but the same problem exists with slightly different manifestations on both Python 2 and 3.

>>> s = 'naïve \\t test' >>> print(s.encode('utf-8').decode('unicode_escape')) naÃ¯ve   test

Well, that's wrong.

The new recommended way to use codecs that decode text into text is to call codecs.decode directly. Does that help?

>>> import codecs >>> print(codecs.decode(s, 'unicode_escape')) naÃ¯ve   test

Not at all. (Also, the above is a UnicodeError on Python 2.)

The unicode_escape codec, despite its name, turns out to assume that all non-ASCII bytes are in the Latin-1 (ISO-8859-1) encoding. So you would have to do it like this:

>>> print(s.encode('latin-1').decode('unicode_escape')) naïve    test

But that's terrible. This limits you to the 256 Latin-1 characters, as if Unicode had never been invented at all!

>>> print('Ernő \\t Rubik'.encode('latin-1').decode('unicode_escape')) UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151' in position 3: ordinal not in range(256)

Adding a regular expression to solve the problem

(Surprisingly, we do not now have two problems.)

What we need to do is only apply the unicode_escape decoder to things that we are certain to be ASCII text. In particular, we can make sure only to apply it to valid Python escape sequences, which are guaranteed to be ASCII text.

The plan is, we'll find escape sequences using a regular expression, and use a function as the argument to re.sub to replace them with their unescaped value.

import re import codecs  ESCAPE_SEQUENCE_RE = re.compile(r'''     ( \\U........      # 8-digit hex escapes     | \\u....          # 4-digit hex escapes     | \\x..            # 2-digit hex escapes     | \\[0-7]{1,3}     # Octal escapes     | \\N\{[^}]+\}     # Unicode characters by name     | \\[\\'"abfnrtv]  # Single-character escapes     )''', re.UNICODE | re.VERBOSE)  def decode_escapes(s):     def decode_match(match):         return codecs.decode(match.group(0), 'unicode-escape')      return ESCAPE_SEQUENCE_RE.sub(decode_match, s)

And with that:

>>> print(decode_escapes('Ernő \\t Rubik')) Ernő     Rubik

answered Oct 13 '22 17:10

rspeer

Related questions
                            
                                Importing from a relative path in Python
                            
                                Dynamically updating plot in matplotlib
                            
                                Why is it slower to iterate over a small string than a small list?
                            
                                Convert bytes to int?
                            
                                Python append() vs. + operator on lists, why do these give different results?
                            
                                matplotlib/seaborn: first and last row cut in half of heatmap plot
                            
                                certificate verify failed: unable to get local issuer certificate
                            
                                Python ? (conditional/ternary) operator for assignments [duplicate]
                            
                                Test a string for a substring [duplicate]
                            
                                Append column to pandas dataframe
                            
                                How to know the version of pip itself
                            
                                How do I disable "missing docstring" warnings at a file-level in Pylint?
                            
                                Python using enumerate inside list comprehension
                            
                                Python-equivalent of short-form "if" in C++ [duplicate]
                            
                                Numpy how to iterate over columns of array?
                            
                                Pythonic way to print list items
                            
                                Circular list iterator in Python
                            
                                Converting a column within pandas dataframe from int to string
                            
                                Is it possible to have multiple statements in a python lambda expression?
                            
                                How to get the index of a maximum element in a NumPy array along one axis

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Process escape sequences in a string in Python

Tags:

python

string

escaping

dln385

People also ask

2 Answers

Jerub

`unicode_escape` doesn't work in general

Adding a regular expression to solve the problem

rspeer

Recent Activity

Donate For Us

Process escape sequences in a string in Python

Tags:

python

string

escaping

dln385

People also ask

2 Answers

Jerub

unicode_escape doesn't work in general

Adding a regular expression to solve the problem

rspeer

Related questions

Recent Activity

Donate For Us

`unicode_escape` doesn't work in general