How can I find out the location of the file cursor when iterating over a file in Python3?
In Python 2.7 it's trivial, use tell()
. In Python3 that same call throws an OSError
:
Traceback (most recent call last):
File "foo.py", line 113, in check_file
pos = infile.tell()
OSError: telling position disabled by next() call
My use case is making a progress bar for reading large CSV files. Computing a total line count is too expensive and requires an extra pass. An approximate value is plenty useful, I don't care about buffers or other sources of noise, I want to know if it'll take 10 seconds or 10 minutes.
Simple code to reproduce the issue. It works as expected on Python 2.7, but throws on Python 3:
file_size = os.stat(path).st_size
with open(path, "r") as infile:
reader = csv.reader(infile)
for row in reader:
pos = infile.tell() # OSError: telling position disabled by next() call
print("At byte {} of {}".format(pos, file_size))
This answer https://stackoverflow.com/a/29641787/321772 suggests that the problem is that the next()
method disables tell()
during iteration. Alternatives are to manually read line by line instead, but that code is inside the CSV module so I can't get at it. I also can't fathom what Python 3 gains by disabling tell()
.
So what is the preferred way to find out your byte offset while iterating over the lines of a file in Python 3?
The csv module just expects the first parameter of the reader
call to be an iterator that returns one line on each next
call. So you can just use a iterator wrapper than counts the characters. If you want the count to be accurate, you will have to open the file in binary mode. But in fact, this is fine because you will have no end of line conversion which is expected by the csv module.
So a possible wrapper is:
class SizedReader:
def __init__(self, fd, encoding='utf-8'):
self.fd = fd
self.size = 0
self.encoding = encoding # specify encoding in constructor, with utf8 as default
def __next__(self):
line = next(self.fd)
self.size += len(line)
return line.decode(self.encoding) # returns a decoded line (a true Python 3 string)
def __iter__(self):
return self
You code would then become:
file_size = os.stat(path).st_size
with open(path, "rb") as infile:
szrdr = SizedReader(infile)
reader = csv.reader(szrdr)
for row in reader:
pos = szrdr.size # gives position at end of current line
print("At byte {} of {}".format(pos, file_size))
The good news here is that you keep all the power of the csv module, including newlines in quoted fields...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With