Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can seek and tell work with UTF-8 encoded documents in Python?

I have an application that generates some large log files > 500MB.

I have written some utilities in Python that allows me to quickly browse the log file and find data of interest. But I now get some datasets where the file is too big to load it all into memory.

I thus want to scan the document once, build an index and then only load the section of the document into memory that I want to look at at a time.

This works for me when I open a 'file' read it one line at a time and store the offset with from file.tell(). I can then come back to that section of the file later with file.seek( offset, 0 ).

My problem is however that I may have UTF-8 in the log files so I need to open them with the codecs module (codecs.open(<filename>, 'r', 'utf-8')). With the resulting object I can call seek and tell but they do not match up.

I assume that codecs needs to do some buffering or maybe it returns character counts instead of bytes from tell?

Is there a way around this?

like image 233
Jeroen Dirks Avatar asked Oct 02 '09 15:10

Jeroen Dirks


2 Answers

If true, this sounds like a bug or limitation of the codecs module, as it's probably confusing byte and character offsets.

I would use the regular open() function for opening the file, then seek()/tell() will give you byte offsets that are always consistent. Whenever you want to read, use f.readline().decode('utf-8').

Beware though, that using the f.read() function can land you in the middle of a multi-byte character, thus producing an UTF-8 decode error. readline() will always work.

This doesn't transparently handle the byte-order mark for you, but chances are your log files do not have BOMs anyway.

like image 77
intgr Avatar answered Sep 21 '22 10:09

intgr


For UTF-8, you don't actually need to open the file with codecs.open. Instead, it is reliable to read the file as a byte string first, and only then decode an individual section (invoking the .decode method on the string). Breaking the file at line boundaries is safe; the only unsafe way to split it would be in the middle of a multi-byte character (which you can recognize from its byte value > 128).

like image 37
Martin v. Löwis Avatar answered Sep 24 '22 10:09

Martin v. Löwis