In Python 3, suppose I have
>>> thai_string = 'สีเ'
Using encode
gives
>>> thai_string.encode('utf-8')
b'\xe0\xb8\xaa\xe0\xb8\xb5'
My question: how can I get encode()
to return a bytes
sequence using \u
instead of \x
? And how can I decode
them back to a Python 3 str
type?
I tried using the ascii
builtin, which gives
>>> ascii(thai_string)
"'\\u0e2a\\u0e35'"
But this doesn't seem quite right, as I can't decode it back to obtain thai_string
.
Python documentation tells me that
\xhh
escapes the character with the hex value hh
while \uxxxx
escapes the character with the 16-bit hex value xxxx
The documentation says that \u
is only used in string literals, but I'm not sure what that means. Is this a hint that my question has a flawed premise?
To do this, simply add a backslash ( \ ) before the character you want to escape.
In Python strings, the backslash “ ” is a special character, also called the “escape” character. It is used in representing certain whitespace characters: “\t” is a tab, “\n” is a new line, and “\r” is a carriage return. Finally, “ ” can be used to escape itself: “\” is the literal backslash character.
The 'u' in front of a string means the string is a Unicode string. A Unicode is a way for a string to represent more characters than a regular ASCII string can.
You can use unicode_escape
:
>>> thai_string.encode('unicode_escape')
b'\\u0e2a\\u0e35\\u0e40'
Note that encode()
will always return a byte string (bytes) and the unicode_escape
encoding is intended to:
Produce a string that is suitable as Unicode literal in Python source code
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With