Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should I decode bytes (using ASCII) without losing any "junk" bytes if xmlcharrefreplace and backslashreplace don't work?

I have a network resource which returns me data that should (according to the specs) be an ASCII encoded string. But in some rare occasions, I get junk data.

One resource for example returns b'\xd3PS-90AC' whereas another resource, for the same key returns b'PS-90AC'

The first value contains a non-ASCII string. Clearly a violation of the spec, but that's unfortunately out of my control. None of us are 100% certain that this really is junk or data which should be kept.

The application calling on the remote resources saves the data in a local database for daily use. I could simply do a data.decode('ascii', 'replace') or ..., 'ignore') but then I would lose data which could turn out to be useful later on.

My immediate reflex was to use 'xmlcharrefreplace' or 'backslashreplace' as error handler. Simply because it would result in a displayable string. But then I get the following error: TypeError: don't know how to handle UnicodeDecodeError in error callback

The only error-handler which worked was surrogateescape, but this seems to be intended for filenames. On the other hand, for my intent and purpose it would work.

Why are 'xmlcharrefreplace' and 'backslashreplace' not working? I don't understand the error.

For example, an expected execution would be:

>>> data = b'\xd3PS-90AC'
>>> new_data = data.decode('ascii', 'xmlcharrefreplace')
>>> print(repr(new_data))
'&#d3;PS-90AC'

This is a contrived example. My aim is to not lose any data. If I would use the ignore or replace error-handler, the byte in question would essentially disappear, and information is lost.

like image 865
exhuma Avatar asked Aug 22 '14 08:08

exhuma


1 Answers

For completeness, wanted to add that as of python 3.5, backslashreplace works for decoding, so you no longer have to add a custom error handler.

like image 54
theamk Avatar answered Sep 29 '22 07:09

theamk