Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how do I .decode('string-escape') in Python3?

I have some escaped strings that need to be unescaped. I'd like to do this in Python.

For example, in Python 2.7 I can do this:

>>> "\\123omething special".decode('string-escape') 'Something special' >>>  

How do I do it in Python 3? This doesn't work:

>>> b"\\123omething special".decode('string-escape') Traceback (most recent call last):   File "<stdin>", line 1, in <module> LookupError: unknown encoding: string-escape >>>  

My goal is to be able to take a string like this:

s\000u\000p\000p\000o\000r\000t\000@\000p\000s\000i\000l\000o\000c\000.\000c\000o\000m\000 

And turn it into:

"[email protected]" 

After I do the conversion, I'll probe to see if the string I have is encoded in UTF-8 or UTF-16.

like image 636
vy32 Avatar asked Feb 11 '13 20:02

vy32


People also ask

How do I decode in Python 3?

Python 3 - String decode() MethodThe decode() method decodes the string using the codec registered for encoding. It defaults to the default string encoding.

How do you handle escape sequence in Python?

n Escape Sequence in Python We can use “\n” here, which tells the interpreter to print some characters in the new line separately. The above example shows that "Bit" is printed in a new line. So we can say that we will get the new line when we type \n in the string before any word or character.

How do I decode a UTF-8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.

How do you handle special characters in a string Python?

Escape sequences allow you to include special characters in strings. To do this, simply add a backslash ( \ ) before the character you want to escape.


Video Answer


2 Answers

You'll have to use unicode_escape instead:

>>> b"\\123omething special".decode('unicode_escape') 

If you start with a str object instead (equivalent to the python 2.7 unicode) you'll need to encode to bytes first, then decode with unicode_escape.

If you need bytes as end result, you'll have to encode again to a suitable encoding (.encode('latin1') for example, if you need to preserve literal byte values; the first 256 Unicode code points map 1-on-1).

Your example is actually UTF-16 data with escapes. Decode from unicode_escape, back to latin1 to preserve the bytes, then from utf-16-le (UTF 16 little endian without BOM):

>>> value = b's\\000u\\000p\\000p\\000o\\000r\\000t\\000@\\000p\\000s\\000i\\000l\\000o\\000c\\000.\\000c\\000o\\000m\\000' >>> value.decode('unicode_escape').encode('latin1')  # convert to bytes b's\x00u\x00p\x00p\x00o\x00r\x00t\x00@\x00p\x00s\x00i\x00l\x00o\x00c\x00.\x00c\x00o\x00m\x00' >>> _.decode('utf-16-le') # decode from UTF-16-LE '[email protected]' 
like image 105
Martijn Pieters Avatar answered Oct 10 '22 10:10

Martijn Pieters


The old "string-escape" codec maps bytestrings to bytestrings, and there's been a lot of debate about what to do with such codecs, so it isn't currently available through the standard encode/decode interfaces.

BUT, the code is still there in the C-API (as PyBytes_En/DecodeEscape), and this is still exposed to Python via the undocumented codecs.escape_encode and codecs.escape_decode.

>>> import codecs >>> codecs.escape_decode(b"ab\\xff") (b'ab\xff', 6) >>> codecs.escape_encode(b"ab\xff") (b'ab\\xff', 3) 

These functions return the transformed bytes object, plus a number indicating how many bytes were processed... you can just ignore the latter.

>>> value = b's\\000u\\000p\\000p\\000o\\000r\\000t\\000@\\000p\\000s\\000i\\000l\\000o\\000c\\000.\\000c\\000o\\000m\\000' >>> codecs.escape_decode(value)[0] b's\x00u\x00p\x00p\x00o\x00r\x00t\x00@\x00p\x00s\x00i\x00l\x00o\x00c\x00.\x00c\x00o\x00m\x00' 
like image 36
Nathaniel J. Smith Avatar answered Oct 10 '22 10:10

Nathaniel J. Smith