I have an application that generates some large log files > 500MB.
I have written some utilities in Python that allows me to quickly browse the log file and find data of interest. But I now get some datasets where the file is too big to load it all into memory.
I thus want to scan the document once, build an index and then only load the section of the document into memory that I want to look at at a time.
This works for me when I open a 'file' read it one line at a time and store the offset with from file.tell(). I can then come back to that section of the file later with file.seek( offset, 0 ).
My problem is however that I may have UTF-8 in the log files so I need to open them with the codecs module (codecs.open(<filename>, 'r', 'utf-8')
). With the resulting object I can call seek and tell but they do not match up.
I assume that codecs needs to do some buffering or maybe it returns character counts instead of bytes from tell?
Is there a way around this?
If true, this sounds like a bug or limitation of the codecs module, as it's probably confusing byte and character offsets.
I would use the regular open()
function for opening the file, then seek()
/tell()
will give you byte offsets that are always consistent. Whenever you want to read, use f.readline().decode('utf-8')
.
Beware though, that using the f.read()
function can land you in the middle of a multi-byte character, thus producing an UTF-8 decode error. readline()
will always work.
This doesn't transparently handle the byte-order mark for you, but chances are your log files do not have BOMs anyway.
For UTF-8, you don't actually need to open the file with codecs.open. Instead, it is reliable to read the file as a byte string first, and only then decode an individual section (invoking the .decode method on the string). Breaking the file at line boundaries is safe; the only unsafe way to split it would be in the middle of a multi-byte character (which you can recognize from its byte value > 128).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With