Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Populate List of Starting Positions of Each Line Using For-loop and Tell Function?

Tags:

python-3.x

All I want to do is create a list of the starting positions of each line so I can seek to them really fast. I am getting the error, "telling position disabled by 'next()' call." How do I overcome this?

>>> in_file = open("data_10000.txt")
>>> in_file.tell()
0
>>> line_numbers = [in_file.tell() for line in in_file]
Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    line_numbers = [in_file.tell() for line in in_file]
  File "<pyshell#9>", line 1, in <listcomp>
    line_numbers = [in_file.tell() for line in in_file]
OSError: telling position disabled by next() call

Note: in this context, the index would relate the line number to the seek position.

like image 531
user2316667 Avatar asked Nov 05 '13 05:11

user2316667


1 Answers

A simple generator can solve your problem:

def line_ind(fileobj):
    i = 0
    for line in fileobj:
        yield i
        i += len(line)

It yields (generates) indices of line starting positions one by one. You know regular functions return a value and stop. When a generator yields a value, it continues to run until exhausted. Sou what I done here is to yield 0 then add the length fo first line to it, then yield it then add the length of second line etc. This produces the indices you want.

To put the yielded values to a list you can use list(generator()) the same as you can use list(range(10)). When you open a file, better to do it using with like below. Not because you will forget to close the file object often (you will), but it automatically closes it in case an exception occured. So with the code below I have two lists of starting position indices:

with open("test.dat", encoding="utf-8") as f:
    u_ind = list(line_ind(f))
    f.seek(0)
    u = f.read()

with open("test.dat", "rb") as f:
    b_ind = list(line_ind(f))
    f.seek(0)
    b = f.read()

Pay attention that indices can differ for unicode strings than for bytestrings. An accented character for example can took two bytes of space. The first list contains the indices of the unicode characters. You will use this when you deal with the regular string representation of your file. The example below shows how the index values differ in the two cases on a test file:

>>> u_ind[-10:]
[24283, 24291, 24300, 24309, 24322, 24331, 24341, 24349, 24359, 24368]
>>> b_ind[-10:]
[27297, 27306, 27316, 27326, 27342, 27352, 27363, 27372, 27383, 27393]

Now I want the content of the last line:

>>> u[24368:]
'S-érték=9,59'
>>> b[27393:]
b'S-\xc3\xa9rt\xc3\xa9k=9,59'

If however you want to use seek() before read(), you have to stick to the byte indices:

>>> with open("test.dat", encoding="utf-8") as f:
...     f.seek(27393)
...     f.read()
...
27393
'S-érték=9,59'
>>> with open("test.dat", "rb") as f:
...     f.seek(27393)
...     f.read()
...
27393
b'S-\xc3\xa9rt\xc3\xa9k=9,59'

Using 24368 in the first case would be a terrible mistake here.

Note that when you read() the content of the file to a string/bytestring object and want to deal with individual lines thereafter it is wiser to use .splitlines().

Hope this helped!

like image 200
SzieberthAdam Avatar answered Nov 20 '22 06:11

SzieberthAdam