I have a string in UTF-8 format but not so sure how to convert this string to it's corresponding character literal. For example I have the string:
My string is: 'Entre\xc3\xa9'
Example one:
This code:
u'Entre\xc3\xa9'.encode('latin-1').decode('utf-8')
returns the result: u'Entre\xe9'
If I then continue by printing this:
print u'Entre\xe9'
I get the result: Entreé
This is great and close to what I need. The problem is, I can't make 'Entre\xc3\xa9' a variable and pass it through the steps as this now breaks. Any tips for getting this working?
Example:
a = 'Entre\xc3\xa9'
b = 'u'+ a.encode('latin-1').decode('utf-8')
c= 'u'+ b
I would like result of "c" to be:
Entreé
str = string( str32 ) converts the UTF-32 representation str32 to string.
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
The popular encodings being utf-8, ascii, etc. Using the string encode() method, you can convert unicode strings into any encodings supported by Python. By default, Python uses utf-8 encoding.
The u''
syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicode
object being created, but that's not the only way to create such an object.
You cannot make a unicode
value from a byte string by adding u
in front of it. But if you called str.decode()
with the right encoding, you get a unicode
value. Vice-versa, you can encode unicode
objects to byte strings with unicode.encode()
.
Note that when displaying a unicode
object, Python represents it by using the Unicode string literal syntax again (so u'...'
), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.
Your a
value is defined using a byte string literal, so you only need to decode:
a = 'Entre\xc3\xa9'
b = a.decode('utf8')
Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.
You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With