Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode latin1 string encode / decode

While fetching data from an unknown/old/non-consistent Mysql database to a Postgres utf-8 db using Python (Django) ORM I have sometimes faulty encoded data as a result.

Target: grégory

> a
u'gr\xe3\xa9gory'

> print a
grã©gory

I tried several decode/encode tricks without success:

 > print a.encode('utf-8').decode('latin1')
 grã©gory

 > print a.encode('utf-8').decode('latin1')
 grã©gory

 > print a.decode('latin-1')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)

Even with some unicode_escape

like image 638
coulix Avatar asked Dec 22 '25 02:12

coulix


1 Answers

I guess the string has been incorrectly converted to lowercase at some point, changing \xc3 to \xe3. The lowercase conversion has assumed latin1 encoding when it was actually utf-8.

>>> print 'gr\xc3\xa9gory'.decode('utf8')
grégory
like image 98
Janne Karila Avatar answered Dec 23 '25 22:12

Janne Karila