Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: handle broken unicode bytes when parsing JSON string

My code makes gets some content from an UserVoice site. As you might know, UserVoice is a shitty piece of software that can't handle data correctly; indeed, to reduce the amount of text on the search page, they cut the text at, let's say, 300 characters and then add a "..." to the end. Thing is, they don't care cutting in the middle of a multi-bytes character, resulting in a partial utf-8 "byte": eg. for the è char, I got \xc3 instead of \xc3\xa8s.

Of course, when I give this horrible soup to json.loads, it fails with UnicodeDecodeError. So my question is simple: how can I ask json.loads to ignore these bad bytes, as I would do using .decode('utf-8', 'ignore') if I had access to the internals of the function?

Thanks.

like image 712
zopieux Avatar asked Nov 02 '11 16:11

zopieux


2 Answers

You don't ask simplejson to ignore them. When I got similar problem like yours I just ran .decode('utf-8', 'ignore').encode('utf-8') and proceed.

like image 144
Lachezar Avatar answered Nov 11 '22 17:11

Lachezar


Just pass Unicode string to json.loads():

>>> badstr = "qualité"[:-1]+".."
>>> badstr
'qualit\xc3..'
>>> json_str = '["%s"]' % badstr
>>> import json
>>> json.loads(json_str)
Traceback (most recent call last):
 ...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 6: invalid \
continuation byte
>>> json.loads(json_str.decode('utf-8','ignore'))
[u'qualit..']
like image 20
jfs Avatar answered Nov 11 '22 15:11

jfs