Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Convert Unicode-Hex utf-8 strings to Unicode strings

Have s = u'Gaga\xe2\x80\x99s' but need to convert to t = u'Gaga\u2019s'

How can this be best achieved?

like image 941
Henry Thornton Avatar asked Sep 30 '11 11:09

Henry Thornton


People also ask

How to convert Unicode string to Hex in Python?

When you do string.encode ('utf-8'), it changes to hex notation. But if you print it, you will get original unicode string. If you want the hex notation you can get it like this with repr () function: The join function is used with the separator '\X' so that for each byte to hex conversion the \X is inserted.

What is the difference between UTF-8 and Unicode in Python?

1 First, str in Python is represented in Unicode. 2 Second, UTF-8 is an encoding standard to encode Unicode string to bytes. There are many encoding standards out there (e. More ...

How to encode Unicode characters in plain strings in Python?

Unicode strings can be encoded in plain strings to whichever encoding you choose. Python Unicode character is the abstract object big enough to hold the character, analogous to Python’s long integers. If the string only contains ASCII characters, use the str () function to convert it into a string. data = u"xyzw" app = str (data) print (app)

How do I convert bytes to UTF-8 in Python?

A good practice is to decode your bytes in UTF-8 (or an encoder that was used to create those bytes) as soon as they are loaded from a file. Run your processing on unicode code points through your Python code, and then write back into bytes into a file using UTF-8 encoder in the end.


2 Answers

s = u'Gaga\xe2\x80\x99s'
t = u'Gaga\u2019s'
x = s.encode('raw-unicode-escape').decode('utf-8')
assert x==t

print(x)

yields

Gaga’s
like image 77
unutbu Avatar answered Nov 14 '22 23:11

unutbu


Where ever you decoded the original string, it was likely decoded with latin-1 or a close relative. Since latin-1 is the first 256 codepoints of Unicode, this works:

>>> s = u'Gaga\xe2\x80\x99s'
>>> s.encode('latin-1').decode('utf8')
u'Gaga\u2019s'
like image 37
Mark Tolonen Avatar answered Nov 14 '22 23:11

Mark Tolonen