All I want to do is create a list of the starting positions of each line so I can seek to them really fast. I am getting the error, "telling position disabled by 'next()' call." How do I overcome this?
>>> in_file = open("data_10000.txt")
>>> in_file.tell()
0
>>> line_numbers = [in_file.tell() for line in in_file]
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
line_numbers = [in_file.tell() for line in in_file]
File "<pyshell#9>", line 1, in <listcomp>
line_numbers = [in_file.tell() for line in in_file]
OSError: telling position disabled by next() call
Note: in this context, the index would relate the line number to the seek position.
A simple generator can solve your problem:
def line_ind(fileobj):
i = 0
for line in fileobj:
yield i
i += len(line)
It yields (generates) indices of line starting positions one by one. You know regular functions return a value and stop. When a generator yields a value, it continues to run until exhausted. Sou what I done here is to yield 0 then add the length fo first line to it, then yield it then add the length of second line etc. This produces the indices you want.
To put the yielded values to a list you can use list(generator())
the same as you can use list(range(10))
. When you open a file, better to do it using with
like below. Not because you will forget to close the file object often (you will), but it automatically closes it in case an exception occured. So with the code below I have two lists of starting position indices:
with open("test.dat", encoding="utf-8") as f:
u_ind = list(line_ind(f))
f.seek(0)
u = f.read()
with open("test.dat", "rb") as f:
b_ind = list(line_ind(f))
f.seek(0)
b = f.read()
Pay attention that indices can differ for unicode strings than for bytestrings. An accented character for example can took two bytes of space. The first list contains the indices of the unicode characters. You will use this when you deal with the regular string representation of your file. The example below shows how the index values differ in the two cases on a test file:
>>> u_ind[-10:]
[24283, 24291, 24300, 24309, 24322, 24331, 24341, 24349, 24359, 24368]
>>> b_ind[-10:]
[27297, 27306, 27316, 27326, 27342, 27352, 27363, 27372, 27383, 27393]
Now I want the content of the last line:
>>> u[24368:]
'S-érték=9,59'
>>> b[27393:]
b'S-\xc3\xa9rt\xc3\xa9k=9,59'
If however you want to use seek()
before read()
, you have to stick to the byte indices:
>>> with open("test.dat", encoding="utf-8") as f:
... f.seek(27393)
... f.read()
...
27393
'S-érték=9,59'
>>> with open("test.dat", "rb") as f:
... f.seek(27393)
... f.read()
...
27393
b'S-\xc3\xa9rt\xc3\xa9k=9,59'
Using 24368 in the first case would be a terrible mistake here.
Note that when you read()
the content of the file to a string/bytestring object and want to deal with individual lines thereafter it is wiser to use .splitlines()
.
Hope this helped!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With