My code makes gets some content from an UserVoice site. As you might know, UserVoice is a shitty piece of software that can't handle data correctly; indeed, to reduce the amount of text on the search page, they cut the text at, let's say, 300 characters and then add a "..." to the end. Thing is, they don't care cutting in the middle of a multi-bytes character, resulting in a partial utf-8 "byte": eg. for the è
char, I got \xc3
instead of \xc3\xa8s
.
Of course, when I give this horrible soup to json.loads
, it fails with UnicodeDecodeError
. So my question is simple: how can I ask json.loads
to ignore these bad bytes, as I would do using .decode('utf-8', 'ignore')
if I had access to the internals of the function?
Thanks.
You don't ask simplejson to ignore them. When I got similar problem like yours I just ran .decode('utf-8', 'ignore').encode('utf-8')
and proceed.
Just pass Unicode string to json.loads()
:
>>> badstr = "qualité"[:-1]+".."
>>> badstr
'qualit\xc3..'
>>> json_str = '["%s"]' % badstr
>>> import json
>>> json.loads(json_str)
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 6: invalid \
continuation byte
>>> json.loads(json_str.decode('utf-8','ignore'))
[u'qualit..']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With