Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python how to read N number of lines at a time

I am writing a code to take an enormous textfile (several GB) N lines at a time, process that batch, and move onto the next N lines until I have completed the entire file. (I don't care if the last batch isn't the perfect size).

I have been reading about using itertools islice for this operation. I think I am halfway there:

from itertools import islice N = 16 infile = open("my_very_large_text_file", "r") lines_gen = islice(infile, N)  for lines in lines_gen:      ...process my lines... 

The trouble is that I would like to process the next batch of 16 lines, but I am missing something

like image 835
brokentypewriter Avatar asked Jun 13 '11 20:06

brokentypewriter


People also ask

How do you read multiple lines of text in Python?

To read multiple lines, call readline() multiple times. The built-in readline() method return one line at a time. To read multiple lines, call readline() multiple times.

How do I count the number of lines in a file in Python?

Use readlines() to get Line Count This is the most straightforward way to count the number of lines in a text file in Python. The readlines() method reads all lines from a file and stores it in a list. Next, use the len() function to find the length of the list which is nothing but total lines present in a file.


2 Answers

islice() can be used to get the next n items of an iterator. Thus, list(islice(f, n)) will return a list of the next n lines of the file f. Using this inside a loop will give you the file in chunks of n lines. At the end of the file, the list might be shorter, and finally the call will return an empty list.

from itertools import islice with open(...) as f:     while True:         next_n_lines = list(islice(f, n))         if not next_n_lines:             break         # process next_n_lines 

An alternative is to use the grouper pattern:

with open(...) as f:     for next_n_lines in izip_longest(*[f] * n):         # process next_n_lines 
like image 130
Sven Marnach Avatar answered Sep 21 '22 23:09

Sven Marnach


The question appears to presume that there is efficiency to be gained by reading an "enormous textfile" in blocks of N lines at a time. This adds an application layer of buffering over the already highly optimized stdio library, adds complexity, and probably buys you absolutely nothing.

Thus:

with open('my_very_large_text_file') as f:     for line in f:         process(line) 

is probably superior to any alternative in time, space, complexity and readability.

See also Rob Pike's first two rules, Jackson's Two Rules, and PEP-20 The Zen of Python. If you really just wanted to play with islice you should have left out the large file stuff.

like image 29
msw Avatar answered Sep 19 '22 23:09

msw