I have a huge text file which I want to open.
I'm reading the file in chunks, avoiding memory issues related to reading too much of the file all at once.
code snippet:
def open_delimited(fileName, args):
with open(fileName, args, encoding="UTF16") as infile:
chunksize = 10000
remainder = ''
for chunk in iter(lambda: infile.read(chunksize), ''):
pieces = re.findall(r"(\d+)\s+(\d+_\d+)", remainder + chunk)
for piece in pieces[:-1]:
yield piece
remainder = '{} {} '.format(*pieces[-1])
if remainder:
yield remainder
the code throws the error UnicodeDecodeError: 'utf16' codec can't decode bytes in position 8190-8191: unexpected end of data
.
I tried UTF8
and got the error UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
.
latin-1
and iso-8859-1
raised the error IndexError: list index out of range
A sample of the input file:
b'\xff\xfe1\x000\x000\x005\x009\x00\t\x001\x000\x000\x005\x009\x00_\x009\x007\x004\x007\x001\x007\x005\x003\x001\x000\x009\x001\x00\t\x00\t\x00P\x00o\x00s\x00t\x00\t\x001\x00\t\x00H\x00a\x00p\x00p\x00y\x00 \x00B\x00i\x00r\x00t\x00h\x00d\x00a\x00y\x00\t\x002\x000\x001\x001\x00-\x000\x008\x00-\x002\x004\x00 \x00'
I will also mention that I have several of those huge text files.UTF16
works fine for many of them, and fail at a specific file.
Anyway to resolve this issue?
To ignore corrupted data (which can lead to data loss), set errors='ignore'
on the open()
call:
with open(fileName, args, encoding="UTF16", errors='ignore') as infile:
The open()
function documentation states:
'ignore'
ignores errors. Note that ignoring encoding errors can lead to data loss.
This does not mean you can recover from the apparent data corruption you are experiencing.
To illustrate, imagine a byte was dropped or added somewhere in your file. UTF-16 is a codec that uses 2 bytes per character. If there is one byte missing or surplus then all byte-pairs following the missing or extra byte are going to be out of alignment.
That can lead to problems decoding further down the line, not necessarily immediately. There are some codepoints in UTF-16 that are illegal, but usually because they are used in combination with another byte-pair; your exception was thrown for such an invalid codepoint. But there may have been hundreds or thousands byte-pairs preceding that point that were valid UTF-16, if not legible text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With