Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alternatives to `tell()` while iterating over lines of a file in Python3?

How can I find out the location of the file cursor when iterating over a file in Python3?

In Python 2.7 it's trivial, use tell(). In Python3 that same call throws an OSError:

Traceback (most recent call last):
  File "foo.py", line 113, in check_file
    pos = infile.tell()
OSError: telling position disabled by next() call

My use case is making a progress bar for reading large CSV files. Computing a total line count is too expensive and requires an extra pass. An approximate value is plenty useful, I don't care about buffers or other sources of noise, I want to know if it'll take 10 seconds or 10 minutes.

Simple code to reproduce the issue. It works as expected on Python 2.7, but throws on Python 3:

file_size = os.stat(path).st_size
with open(path, "r") as infile:
    reader = csv.reader(infile)
    for row in reader:
        pos = infile.tell()  # OSError: telling position disabled by next() call
        print("At byte {} of {}".format(pos, file_size))

This answer https://stackoverflow.com/a/29641787/321772 suggests that the problem is that the next() method disables tell() during iteration. Alternatives are to manually read line by line instead, but that code is inside the CSV module so I can't get at it. I also can't fathom what Python 3 gains by disabling tell().

So what is the preferred way to find out your byte offset while iterating over the lines of a file in Python 3?

like image 963
Adam Avatar asked Sep 25 '17 12:09

Adam


1 Answers

The csv module just expects the first parameter of the reader call to be an iterator that returns one line on each next call. So you can just use a iterator wrapper than counts the characters. If you want the count to be accurate, you will have to open the file in binary mode. But in fact, this is fine because you will have no end of line conversion which is expected by the csv module.

So a possible wrapper is:

class SizedReader:
    def __init__(self, fd, encoding='utf-8'):
        self.fd = fd
        self.size = 0
        self.encoding = encoding   # specify encoding in constructor, with utf8 as default
    def __next__(self):
        line = next(self.fd)
        self.size += len(line)
        return line.decode(self.encoding)   # returns a decoded line (a true Python 3 string)
    def __iter__(self):
        return self

You code would then become:

file_size = os.stat(path).st_size
with open(path, "rb") as infile:
    szrdr = SizedReader(infile)
    reader = csv.reader(szrdr)
    for row in reader:
        pos = szrdr.size  # gives position at end of current line
        print("At byte {} of {}".format(pos, file_size))

The good news here is that you keep all the power of the csv module, including newlines in quoted fields...

like image 128
Serge Ballesta Avatar answered Oct 11 '22 07:10

Serge Ballesta