Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python3 reading mixed text/binary data line-by-line

I need to parse a file which has a UTF-16 text header and followed directly by binary data. To be able to read the binary data, I open the file in "rb" mode, then, for reading the header, wrap it into a io.TextIOWrapper().

The problem is that when I execute the .readline() method of the TextIOWrapper object, the wrapper reads ahead too far (even though I only requested a single line) and then runs into a UTF-16 decoding error when encountering the binary portion: A UnicodeDecodeError is raised.

However, I need proper parsing of the text data and cannot simply do a binary read first, then do a data.find(b"\n\0") because it's not guaranteed that this actually matches at an even offset (could be midway in-between characters). I would like to avoid doing UTF-16 parsing myself.

Is there an easy way to tell TextIOWrapper to not read ahead?

like image 795
itecMemory Avatar asked Sep 06 '18 09:09

itecMemory


1 Answers

No, you can't use a TextIOWrapper() object because it'll read from the underlying buffer in larger blocks, not just in lines, so yes, it'll try to decode binary data past than first line. You can't prevent this.

For a single line of text using \n line delimiters, you really don't need to use TextIOWrapper(). Binary files still support line-by-line reading, where file.readline() will give you the binary data up to the next \n byte. Just open the file as binary, and read one line.

Valid UTF-16 data will always have an even length. But because UTF-16 comes in two flavours, big endian and little endian byte orders, you'll need to check how much data was read to see what byte order was used, in order to conditionally read a single byte that should be part of that first line of data. If UTF-16 little-endian was used, you are guaranteed to have read an odd number of bytes, as newlines are encoded to 0a 00 rather than 00 0a and the .readline() call will have left the single 00 byte in the file stream. In that case, just read one more byte and add it to the first line data before decoding:

with open(filename, 'rb') as binfile:
    firstline = binfile.readline()
    if len(firstline) % 2:
        # little-endian UTF-16, add one more byte
        firstline += binfile.read(1)
    text = firstline.decode('utf-16')

    # read binary data from the file

A demo with io.BytesIO() where we first write UTF-16 little-endian data (with the BOM to indicate the byte order for the decoder), with the text followed by two low-surrogate sequences that would cause a UTF-16 decoding error to stand for 'binary data', after which we read the text and data again:

>>> import io, codecs
>>> from pprint import pprint
>>> binfile = io.BytesIO()
>>> utf16le_wrapper = io.TextIOWrapper(binfile, encoding='utf-16-le', write_through=True)
>>> utf16le_wrapper.write('\ufeff')  # write the UTF-16 BOM manually, as the -le and -be variants won't include this
1
>>> utf16le_wrapper.write('The quick brown 🦊 jumps over the lazy 🐕\n')
40
>>> binfile.write(b'\xDF\xFF\xDF\xFF')  # binary data, guaranteed to not decode as UTF-16
4
>>> binfile.flush()  # flush and seek back to start to move to reading
>>> binfile.seek(0)
0
>>> firstline = binfile.readline()  # read that first line
>>> len(firstline) % 2              # confirm we read an odd number of bytes
1
>>> firstline += binfile.read(1)    # add the expected null byte
>>> pprint(firstline)               # peek at the UTF-16 data we read
(b'\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00'
 b'w\x00n\x00 \x00>\xd8\x8a\xdd \x00j\x00u\x00m\x00p\x00s\x00 \x00o\x00v\x00'
 b'e\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00=\xd8\x15\xdc'
 b'\n\x00')
>>> print(firstline.decode('utf-16'))  # bom included, so the decoder detects LE vs BE
The quick brown 🦊 jumps over the lazy 🐕

>>> binfile.read()
b'\xdf\xff\xdf\xff'

Any alternative implementations that still can use TextIOWrapper() would require an intermediate wrapper to sit between the binary file and the TextIOWrapper() instance to prevent TextIOWrapper() to read too far, and this would get complex fast and would require the wrapper have knowledge of the codec used anyway. For a single line of text, that's just not worth the effort required.

like image 96
Martijn Pieters Avatar answered Oct 18 '22 04:10

Martijn Pieters