Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeDecodeError: unexpected end of data

I have a huge text file which I want to open.
I'm reading the file in chunks, avoiding memory issues related to reading too much of the file all at once.

code snippet:

def open_delimited(fileName, args):

    with open(fileName, args, encoding="UTF16") as infile:
        chunksize = 10000
        remainder = ''
        for chunk in iter(lambda: infile.read(chunksize), ''):
            pieces = re.findall(r"(\d+)\s+(\d+_\d+)", remainder + chunk)
            for piece in pieces[:-1]:
                yield piece
            remainder = '{} {} '.format(*pieces[-1]) 
        if remainder:
            yield remainder

the code throws the error UnicodeDecodeError: 'utf16' codec can't decode bytes in position 8190-8191: unexpected end of data.

I tried UTF8 and got the error UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte.

latin-1 and iso-8859-1 raised the error IndexError: list index out of range

A sample of the input file:

b'\xff\xfe1\x000\x000\x005\x009\x00\t\x001\x000\x000\x005\x009\x00_\x009\x007\x004\x007\x001\x007\x005\x003\x001\x000\x009\x001\x00\t\x00\t\x00P\x00o\x00s\x00t\x00\t\x001\x00\t\x00H\x00a\x00p\x00p\x00y\x00 \x00B\x00i\x00r\x00t\x00h\x00d\x00a\x00y\x00\t\x002\x000\x001\x001\x00-\x000\x008\x00-\x002\x004\x00 \x00'

I will also mention that I have several of those huge text files.
UTF16 works fine for many of them, and fail at a specific file.

Anyway to resolve this issue?

like image 888
Presen Avatar asked Aug 21 '13 12:08

Presen


1 Answers

To ignore corrupted data (which can lead to data loss), set errors='ignore' on the open() call:

with open(fileName, args, encoding="UTF16", errors='ignore') as infile:

The open() function documentation states:

  • 'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.

This does not mean you can recover from the apparent data corruption you are experiencing.

To illustrate, imagine a byte was dropped or added somewhere in your file. UTF-16 is a codec that uses 2 bytes per character. If there is one byte missing or surplus then all byte-pairs following the missing or extra byte are going to be out of alignment.

That can lead to problems decoding further down the line, not necessarily immediately. There are some codepoints in UTF-16 that are illegal, but usually because they are used in combination with another byte-pair; your exception was thrown for such an invalid codepoint. But there may have been hundreds or thousands byte-pairs preceding that point that were valid UTF-16, if not legible text.

like image 106
Martijn Pieters Avatar answered Dec 08 '22 04:12

Martijn Pieters