Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse file in robust way with python 3

I have a log file that I need to go through line by line, and apparently it contains some "bad bytes". I get an error message along the following lines:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 9: invalid start byte

I have been able to strip down the problem to a file "log.test" containing the following line:

Message: \260

(At least this is how it shows up in my Emacs.)

I have a file "demo_error.py" which looks like this:

import sys
with open(sys.argv[1], 'r') as lf:
    for i, l in enumerate(lf):
        print(i, l.strip())

I then run, from the command line:

$ python3 demo_error.py log.test

The full traceback is:

Traceback (most recent call last):
  File "demo_error.py", line 5, in <module>
    for i, l in enumerate(lf):
  File     "/usr/local/Cellar/python3/3.4.0/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 13: invalid start byte

My hunch is that I have to somehow specify a more general codec ("raw ascii" for instance) - but I'm not quite sure how to do this.

Note that this is not really a problem in Python 2.7.

And just to make my point clear: I don't mind getting an exception for the line in question - then I can simply discard the line. The problem is that the exception seems to happen on the "for" loop itself, which makes special handling of that particular line impossible.

like image 873
ukrutt Avatar asked Sep 01 '25 05:09

ukrutt


1 Answers

You can also use the codecs module. When you use the codecs.open() function, you can specify how it handles errors using the errors argument:

codecs.open(filename, mode[, encoding[, errors[, buffering]]])

The errors argument can be one of several different keywords that specify how you want Python to behave when it attempts to decode a character that is invalid for the current encoding. You'll probably be most interested in codecs.ignore_errors or codecs.replace_errors, which cause invalid characters to be either ignored or replaced with a default character, respectively.

This method can be a good alternative when you know you have corrupt data that will cause the UnicodeDecodeError to be raised even when you specify the correct encoding.

Example:

with codecs.open('file.txt', mode='r', errors='ignore'):
    # ...stuff...
    # Even if there is corrupt data and invalid characters for the default
    # encoding, this open() will still succeed
like image 182
skrrgwasme Avatar answered Sep 02 '25 17:09

skrrgwasme