Parse file in robust way with python 3

Question

I have a log file that I need to go through line by line, and apparently it contains some "bad bytes". I get an error message along the following lines:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 9: invalid start byte

I have been able to strip down the problem to a file "log.test" containing the following line:

Message: \260

(At least this is how it shows up in my Emacs.)

I have a file "demo_error.py" which looks like this:

import sys
with open(sys.argv[1], 'r') as lf:
    for i, l in enumerate(lf):
        print(i, l.strip())

I then run, from the command line:

$ python3 demo_error.py log.test

The full traceback is:

Traceback (most recent call last):
  File "demo_error.py", line 5, in <module>
    for i, l in enumerate(lf):
  File     "/usr/local/Cellar/python3/3.4.0/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 13: invalid start byte

My hunch is that I have to somehow specify a more general codec ("raw ascii" for instance) - but I'm not quite sure how to do this.

Note that this is not really a problem in Python 2.7.

And just to make my point clear: I don't mind getting an exception for the line in question - then I can simply discard the line. The problem is that the exception seems to happen on the "for" loop itself, which makes special handling of that particular line impossible.

skrrgwasme · Accepted Answer

You can also use the codecs module. When you use the codecs.open() function, you can specify how it handles errors using the errors argument:

codecs.open(filename, mode[, encoding[, errors[, buffering]]])

The errors argument can be one of several different keywords that specify how you want Python to behave when it attempts to decode a character that is invalid for the current encoding. You'll probably be most interested in codecs.ignore_errors or codecs.replace_errors, which cause invalid characters to be either ignored or replaced with a default character, respectively.

This method can be a good alternative when you know you have corrupt data that will cause the UnicodeDecodeError to be raised even when you specify the correct encoding.

Example:

with codecs.open('file.txt', mode='r', errors='ignore'):
    # ...stuff...
    # Even if there is corrupt data and invalid characters for the default
    # encoding, this open() will still succeed

Parse file in robust way with python 3

Tags:

python

python-3.x

ukrutt

1 Answers

skrrgwasme

Recent Activity

Donate For Us

Parse file in robust way with python 3

Tags:

python

python-3.x

ukrutt

1 Answers

skrrgwasme

Related questions

Recent Activity

Donate For Us