Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory-efficent way to iterate over part of a large file

I normally avoid reading files like this:

with open(file) as f:
    list_of_lines = f.readlines()

and use this type of code instead.

f = open(file)
for line in file:
     #do something

Unless I only have to iterate over a few lines in a file (and I know which lines those are) then it think it is easier to take slices of the list_of_lines. Now this has come back to bite me. I have a HUGE file (reading it into memory is not possible) but I don't need to iterate over all of the lines just a few of them. I have code completed that finds where my first line is and finds how many lines after that I need to edit. I just don't have nay idea how to write this loop.

n = #grep for number of lines 
start = #pattern match the start line 
f=open('big_file')
#some loop over f from start o start + n
      #edit lines  

EDIT: my title may have lead to a debate rather than an answer.

like image 405
Ajay Avatar asked Jun 19 '14 16:06

Ajay


1 Answers

If I understand your question correctly, the problem you're encountering is that storing all the lines of text in a list and then taking a slice uses too much memory. What you want is to read the file line-by-line, while ignoring all but a certain set of lines (say, lines [17,34) for example).

Try using enumerate to keep track of which line number you're on as you iterate through the file. Here is a generator-based approach which uses yield to output the interesting lines only one at a time:

def read_only_lines(f, start, finish):
    for ii,line in enumerate(f):
        if ii>=start and ii<finish:
            yield line
        elif ii>=finish:
            return

f = open("big text file.txt", "r")
for line in read_only_lines(f, 17, 34):
    print line

This read_only_lines function basically reimplements itertools.islice from the standard library, so you could use that to make an even more compact implementation:

from itertools import islice
for line in islice(f, 17, 34):
    print line

If you want to capture the lines of interest in a list rather than a generator, just cast them with a list:

from itertools import islice
lines_of_interest = list( islice(f, 17, 34) )

do_something_awesome( lines_of_interest )
do_something_else( lines_of_interest )
like image 134
Dan Lenski Avatar answered Oct 27 '22 00:10

Dan Lenski