UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?

Tags:

I have to read a text file into Python. The file encoding is:

file -bi test.csv  text/plain; charset=us-ascii

This is a third-party file, and I get a new one every day, so I would rather not change it. The file has non ascii characters, such as Ã, for example. I need to read the lines using python, and I can afford to ignore a line which has a non-ascii character.

My problem is that when I read the file in Python, I get the UnicodeDecodeError when reaching the line where a non-ascii character exists, and I cannot read the rest of the file.

Is there a way to avoid this. If I try this:

fileHandle = codecs.open("test.csv", encoding='utf-8'); try:     for line in companiesFile:         print(line, end=""); except UnicodeDecodeError:     pass;

then when the error is reached the for loop ends and I cannot read the remaining of the file. I want to skip the line that causes the mistake and go on. I would rather not do any changes to the input file, if possible.

Is there any way to do this? Thank you very much.

498

asked Jul 07 '14 17:07

Chicoscience

1 Answers

Your file doesn't appear to use the UTF-8 encoding. It is important to use the correct codec when opening a file.

You can tell open() how to treat decoding errors, with the errors keyword:

errors is an optional string that specifies how encoding and decoding errors are to be handledâthis cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error() is also valid. The standard names are:

'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.

'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.

'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.

'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.

'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.

'backslashreplace' (also only supported when writing) replaces unsupported characters with Pythonâs backslashed escape sequences.

Opening the file with anything other than 'strict' ('ignore', 'replace', etc.) will then let you read the file without exceptions being raised.

Note that decoding takes place per buffered block of data, not per textual line. If you must detect errors on a line-by-line basis, use the surrogateescape handler and test each line read for codepoints in the surrogate range:

import re  _surrogates = re.compile(r"[\uDC80-\uDCFF]")  def detect_decoding_errors_line(l, _s=_surrogates.finditer):     """Return decoding errors in a line of text      Works with text lines decoded with the surrogateescape     error handler.      Returns a list of (pos, byte) tuples      """     # DC80 - DCFF encode bad bytes 80-FF     return [(m.start(), bytes([ord(m.group()) - 0xDC00]))             for m in _s(l)]

E.g.

with open("test.csv", encoding="utf8", errors="surrogateescape") as f:     for i, line in enumerate(f, 1):         errors = detect_decoding_errors_line(line)         if errors:             print(f"Found errors on line {i}:")             for (col, b) in errors:                 print(f" {col + 1:2d}: {b[0]:02x}")

Take into account that not all decoding errors can be recovered from gracefully. While UTF-8 is designed to be robust in the face of small errors, other multi-byte encodings such as UTF-16 and UTF-32 can't cope with dropped or extra bytes, which will then affect how accurately line separators can be located. The above approach can then result in the remainder of the file being treated as one long line. If the file is big enough, that can then in turn lead to a MemoryError exception if the 'line' is large enough.

183

answered Oct 11 '22 15:10

Martijn Pieters

Related questions
                            
                                Python: OverflowError: math range error
                            
                                How to use comparison and ' if not' in python?
                            
                                Python's range() analog in Common Lisp
                            
                                Parameters to numpy's fromfunction
                            
                                How do I turn MongoDB query into a JSON?
                            
                                How to know the position of items in a Python ordered dictionary
                            
                                Getting corresponding module from function
                            
                                Django What is reverse relationship?
                            
                                Flask, Keep getting 404 serving static files using send_static_file
                            
                                I can't find imap() in itertools in Python
                            
                                How does python "know" what to do with the "in" keyword?
                            
                                PyQt sending parameter to slot when connecting to a signal
                            
                                How can I move file into Recycle Bin / trash on different platforms using PyQt4?
                            
                                Proper way to test Django signals
                            
                                Fastest way to pack a list of floats into bytes in python
                            
                                Python RegExp global flag
                            
                                Visualization of scatter plots with overlapping points in matplotlib
                            
                                how to successfully install pyproj and geopandas?
                            
                                How to remove rows with null values from kth column onward in python
                            
                                How to delete the first line of a text file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?

Tags:

python

file

python-3.x

utf-8

Chicoscience

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us