I have some escaped strings that need to be unescaped. I'd like to do this in Python.
For example, in Python 2.7 I can do this:
>>> "\\123omething special".decode('string-escape') 'Something special' >>>
How do I do it in Python 3? This doesn't work:
>>> b"\\123omething special".decode('string-escape') Traceback (most recent call last): File "<stdin>", line 1, in <module> LookupError: unknown encoding: string-escape >>>
My goal is to be able to take a string like this:
s\000u\000p\000p\000o\000r\000t\000@\000p\000s\000i\000l\000o\000c\000.\000c\000o\000m\000
And turn it into:
"[email protected]"
After I do the conversion, I'll probe to see if the string I have is encoded in UTF-8 or UTF-16.
Python 3 - String decode() MethodThe decode() method decodes the string using the codec registered for encoding. It defaults to the default string encoding.
n Escape Sequence in Python We can use “\n” here, which tells the interpreter to print some characters in the new line separately. The above example shows that "Bit" is printed in a new line. So we can say that we will get the new line when we type \n in the string before any word or character.
To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.
Escape sequences allow you to include special characters in strings. To do this, simply add a backslash ( \ ) before the character you want to escape.
You'll have to use unicode_escape
instead:
>>> b"\\123omething special".decode('unicode_escape')
If you start with a str
object instead (equivalent to the python 2.7 unicode) you'll need to encode to bytes first, then decode with unicode_escape
.
If you need bytes as end result, you'll have to encode again to a suitable encoding (.encode('latin1')
for example, if you need to preserve literal byte values; the first 256 Unicode code points map 1-on-1).
Your example is actually UTF-16 data with escapes. Decode from unicode_escape
, back to latin1
to preserve the bytes, then from utf-16-le
(UTF 16 little endian without BOM):
>>> value = b's\\000u\\000p\\000p\\000o\\000r\\000t\\000@\\000p\\000s\\000i\\000l\\000o\\000c\\000.\\000c\\000o\\000m\\000' >>> value.decode('unicode_escape').encode('latin1') # convert to bytes b's\x00u\x00p\x00p\x00o\x00r\x00t\x00@\x00p\x00s\x00i\x00l\x00o\x00c\x00.\x00c\x00o\x00m\x00' >>> _.decode('utf-16-le') # decode from UTF-16-LE '[email protected]'
The old "string-escape" codec maps bytestrings to bytestrings, and there's been a lot of debate about what to do with such codecs, so it isn't currently available through the standard encode/decode interfaces.
BUT, the code is still there in the C-API (as PyBytes_En/DecodeEscape
), and this is still exposed to Python via the undocumented codecs.escape_encode
and codecs.escape_decode
.
>>> import codecs >>> codecs.escape_decode(b"ab\\xff") (b'ab\xff', 6) >>> codecs.escape_encode(b"ab\xff") (b'ab\\xff', 3)
These functions return the transformed bytes
object, plus a number indicating how many bytes were processed... you can just ignore the latter.
>>> value = b's\\000u\\000p\\000p\\000o\\000r\\000t\\000@\\000p\\000s\\000i\\000l\\000o\\000c\\000.\\000c\\000o\\000m\\000' >>> codecs.escape_decode(value)[0] b's\x00u\x00p\x00p\x00o\x00r\x00t\x00@\x00p\x00s\x00i\x00l\x00o\x00c\x00.\x00c\x00o\x00m\x00'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With