Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3 chokes on CP-1252/ANSI reading

I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:

  File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>

The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?

This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.

UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.

like image 737
Aaron Altman Avatar asked Jul 19 '10 20:07

Aaron Altman


2 Answers

Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:

>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
  ...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>

or with an actual file:

>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
  ...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte

Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:

>>> open('test.txt', encoding='latin-1').read()
'\x81\n'

Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.

like image 117
Marius Gedminas Avatar answered Oct 03 '22 04:10

Marius Gedminas


You can relax the error handling.

For instance:

f = open(filename, encoding="...", errors="replace")

Or:

f = open(filename, encoding="...", errors="ignore")

See the docs.

EDIT:

But are you certain that the problem is in reading the file? Could it be that the exception happens when something is written to the console? Check http://wiki.python.org/moin/PrintFails

like image 32
codeape Avatar answered Oct 03 '22 03:10

codeape