Have s = u'Gaga\xe2\x80\x99s'
but need to convert to t = u'Gaga\u2019s'
How can this be best achieved?
When you do string.encode ('utf-8'), it changes to hex notation. But if you print it, you will get original unicode string. If you want the hex notation you can get it like this with repr () function: The join function is used with the separator '\X' so that for each byte to hex conversion the \X is inserted.
1 First, str in Python is represented in Unicode. 2 Second, UTF-8 is an encoding standard to encode Unicode string to bytes. There are many encoding standards out there (e. More ...
Unicode strings can be encoded in plain strings to whichever encoding you choose. Python Unicode character is the abstract object big enough to hold the character, analogous to Python’s long integers. If the string only contains ASCII characters, use the str () function to convert it into a string. data = u"xyzw" app = str (data) print (app)
A good practice is to decode your bytes in UTF-8 (or an encoder that was used to create those bytes) as soon as they are loaded from a file. Run your processing on unicode code points through your Python code, and then write back into bytes into a file using UTF-8 encoder in the end.
s = u'Gaga\xe2\x80\x99s'
t = u'Gaga\u2019s'
x = s.encode('raw-unicode-escape').decode('utf-8')
assert x==t
print(x)
yields
Gaga’s
Where ever you decoded the original string, it was likely decoded with latin-1 or a close relative. Since latin-1 is the first 256 codepoints of Unicode, this works:
>>> s = u'Gaga\xe2\x80\x99s'
>>> s.encode('latin-1').decode('utf8')
u'Gaga\u2019s'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With