I am writing a code to take an enormous textfile (several GB) N lines at a time, process that batch, and move onto the next N lines until I have completed the entire file. (I don't care if the last batch isn't the perfect size).
I have been reading about using itertools islice for this operation. I think I am halfway there:
from itertools import islice N = 16 infile = open("my_very_large_text_file", "r") lines_gen = islice(infile, N) for lines in lines_gen: ...process my lines...
The trouble is that I would like to process the next batch of 16 lines, but I am missing something
To read multiple lines, call readline() multiple times. The built-in readline() method return one line at a time. To read multiple lines, call readline() multiple times.
Use readlines() to get Line Count This is the most straightforward way to count the number of lines in a text file in Python. The readlines() method reads all lines from a file and stores it in a list. Next, use the len() function to find the length of the list which is nothing but total lines present in a file.
islice()
can be used to get the next n
items of an iterator. Thus, list(islice(f, n))
will return a list of the next n
lines of the file f
. Using this inside a loop will give you the file in chunks of n
lines. At the end of the file, the list might be shorter, and finally the call will return an empty list.
from itertools import islice with open(...) as f: while True: next_n_lines = list(islice(f, n)) if not next_n_lines: break # process next_n_lines
An alternative is to use the grouper pattern:
with open(...) as f: for next_n_lines in izip_longest(*[f] * n): # process next_n_lines
The question appears to presume that there is efficiency to be gained by reading an "enormous textfile" in blocks of N lines at a time. This adds an application layer of buffering over the already highly optimized stdio
library, adds complexity, and probably buys you absolutely nothing.
Thus:
with open('my_very_large_text_file') as f: for line in f: process(line)
is probably superior to any alternative in time, space, complexity and readability.
See also Rob Pike's first two rules, Jackson's Two Rules, and PEP-20 The Zen of Python. If you really just wanted to play with islice
you should have left out the large file stuff.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With