Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with wacky encodings in Python

I have a Python script that pulls in data from many sources (databases, files, etc.). Supposedly, all the strings are unicode, but what I end up getting is any variation on the following theme (as returned by repr()):

u'D\\xc3\\xa9cor'
u'D\xc3\xa9cor'
'D\\xc3\\xa9cor'
'D\xc3\xa9cor'

Is there a reliable way to take any four of the above strings and return the proper unicode string?

u'D\xe9cor' # --> Décor

The only way I can think of right now uses eval(), replace(), and a deep, burning shame that will never wash away.

like image 489
Tyson Avatar asked Jun 30 '26 16:06

Tyson


2 Answers

That's just UTF-8 data. Use .decode to convert it into unicode.

>>> 'D\xc3\xa9cor'.decode('utf-8')
u'D\xe9cor'

You can perform an additional string-escape decode for the 'D\\xc3\\xa9cor' case.

>>> 'D\xc3\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
>>> 'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
>>> u'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'

To handle the 2nd case as well, you need to detect if the input is unicode, and convert it into a str first.

>>> def conv(s):
...   if isinstance(s, unicode):
...     s = s.encode('iso-8859-1')
...   return s.decode('string-escape').decode('utf-8')
... 
>>> map(conv, [u'D\\xc3\\xa9cor', u'D\xc3\xa9cor', 'D\\xc3\\xa9cor', 'D\xc3\xa9cor'])
[u'D\xe9cor', u'D\xe9cor', u'D\xe9cor', u'D\xe9cor']
like image 118
kennytm Avatar answered Jul 02 '26 05:07

kennytm


Write adapters that know which transformations should be applied to their sources.

>>> 'D\xc3\xa9cor'.decode('utf-8')
u'D\xe9cor'
>>> 'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
like image 34
Ignacio Vazquez-Abrams Avatar answered Jul 02 '26 05:07

Ignacio Vazquez-Abrams