Can seek and tell work with UTF-8 encoded documents in Python?

Question

I have an application that generates some large log files > 500MB.

I have written some utilities in Python that allows me to quickly browse the log file and find data of interest. But I now get some datasets where the file is too big to load it all into memory.

I thus want to scan the document once, build an index and then only load the section of the document into memory that I want to look at at a time.

This works for me when I open a 'file' read it one line at a time and store the offset with from file.tell(). I can then come back to that section of the file later with file.seek( offset, 0 ).

My problem is however that I may have UTF-8 in the log files so I need to open them with the codecs module (codecs.open(<filename>, 'r', 'utf-8')). With the resulting object I can call seek and tell but they do not match up.

I assume that codecs needs to do some buffering or maybe it returns character counts instead of bytes from tell?

Is there a way around this?

intgr · Accepted Answer

If true, this sounds like a bug or limitation of the codecs module, as it's probably confusing byte and character offsets.

I would use the regular open() function for opening the file, then seek()/tell() will give you byte offsets that are always consistent. Whenever you want to read, use f.readline().decode('utf-8').

Beware though, that using the f.read() function can land you in the middle of a multi-byte character, thus producing an UTF-8 decode error. readline() will always work.

This doesn't transparently handle the byte-order mark for you, but chances are your log files do not have BOMs anyway.

Martin v. Löwis · Answer

For UTF-8, you don't actually need to open the file with codecs.open. Instead, it is reliable to read the file as a byte string first, and only then decode an individual section (invoking the .decode method on the string). Breaking the file at line boundaries is safe; the only unsafe way to split it would be in the middle of a multi-byte character (which you can recognize from its byte value > 128).

Can seek and tell work with UTF-8 encoded documents in Python?

Tags:

python

utf-8

codec

seek

Jeroen Dirks

2 Answers

intgr

Martin v. Löwis

Recent Activity

Donate For Us

Can seek and tell work with UTF-8 encoded documents in Python?

Tags:

python

utf-8

codec

seek

Jeroen Dirks

2 Answers

intgr

Martin v. Löwis

Related questions

Recent Activity

Donate For Us