Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python "string_escape" vs "unicode_escape"

According to the docs, the builtin string encoding string_escape:

Produce[s] a string that is suitable as string literal in Python source code

...while the unicode_escape:

Produce[s] a string that is suitable as Unicode literal in Python source code

So, they should have roughly the same behaviour. BUT, they appear to treat single quotes differently:

>>> print """before '" \0 after""".encode('string-escape') before \'" \x00 after >>> print """before '" \0 after""".encode('unicode-escape') before '" \x00 after 

The string_escape escapes the single quote while the Unicode one does not. Is it safe to assume that I can simply:

>>> escaped = my_string.encode('unicode-escape').replace("'", "\\'") 

...and get the expected behaviour?

Edit: Just to be super clear, the expected behavior is getting something suitable as a literal.

like image 698
Mike Boers Avatar asked Jun 03 '10 19:06

Mike Boers


People also ask

What is unicode_escape in Python?

The encoding `unicode_escape` is not about escaping unicode characters. It's about python source code. It's defined as: > Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped.

How do I decode a UTF 8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.


2 Answers

According to my interpretation of the implementation of unicode-escape and the unicode repr in the CPython 2.6.5 source, yes; the only difference between repr(unicode_string) and unicode_string.encode('unicode-escape') is the inclusion of wrapping quotes and escaping whichever quote was used.

They are both driven by the same function, unicodeescape_string. This function takes a parameter whose sole function is to toggle the addition of the wrapping quotes and escaping of that quote.

like image 119
Mike Boers Avatar answered Oct 09 '22 13:10

Mike Boers


Within the range 0 ≤ c < 128, yes the ' is the only difference for CPython 2.6.

>>> set(unichr(c).encode('unicode_escape') for c in range(128)) - set(chr(c).encode('string_escape') for c in range(128)) set(["'"]) 

Outside of this range the two types are not exchangeable.

>>> '\x80'.encode('string_escape') '\\x80' >>> '\x80'.encode('unicode_escape') Traceback (most recent call last):   File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can’t decode byte 0x80 in position 0: ordinal not in range(128)  >>> u'1'.encode('unicode_escape') '1' >>> u'1'.encode('string_escape') Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: escape_encode() argument 1 must be str, not unicode 

On Python 3.x, the string_escape encoding no longer exists, since str can only store Unicode.

like image 41
kennytm Avatar answered Oct 09 '22 13:10

kennytm