I am writing a code to take an enormous textfile (several GB) N lines at a time, process that batch, and move onto the next N lines until I have completed the entire file. (I don't care if the last batch isn't the perfect size). I have been reading about using itertools islice for this operation. I think I am halfway there: <pre class="prettyprint"><code>from itertools import islice N = 16 infile = open("my_very_large_text_file", "r") lines_gen = islice(infile, N) for lines in lines_gen: ...process my lines... </code></pre> The trouble is that I would like to process the next batch of 16 lines, but I am missing something

<code>islice()</code> can be used to get the next <code>n</code> items of an iterator. Thus, <code>list(islice(f, n))</code> will return a list of the next <code>n</code> lines of the file <code>f</code>. Using this inside a loop will give you the file in chunks of <code>n</code> lines. At the end of the file, the list might be shorter, and finally the call will return an empty list. <pre class="prettyprint"><code>from itertools import islice with open(...) as f: while True: next_n_lines = list(islice(f, n)) if not next_n_lines: break # process next_n_lines </code></pre> An alternative is to use the grouper pattern: <pre class="prettyprint"><code>with open(...) as f: for next_n_lines in izip_longest(*[f] * n): # process next_n_lines </code></pre>

The question appears to presume that there is efficiency to be gained by reading an "enormous textfile" in blocks of N lines at a time. This adds an application layer of buffering over the already highly optimized <code>stdio</code> library, adds complexity, and probably buys you absolutely nothing. Thus: <pre class="prettyprint"><code>with open('my_very_large_text_file') as f: for line in f: process(line) </code></pre> is probably superior to any alternative in time, space, complexity and readability. See also Rob Pike's first two rules, Jackson's Two Rules, and PEP-20 The Zen of Python. If you really just wanted to play with <code>islice</code> you should have left out the large file stuff.

Python how to read N number of lines at a time

Tags:

python

lines

itertools

I am writing a code to take an enormous textfile (several GB) N lines at a time, process that batch, and move onto the next N lines until I have completed the entire file. (I don't care if the last batch isn't the perfect size).

I have been reading about using itertools islice for this operation. I think I am halfway there:

from itertools import islice N = 16 infile = open("my_very_large_text_file", "r") lines_gen = islice(infile, N)  for lines in lines_gen:      ...process my lines...

The trouble is that I would like to process the next batch of 16 lines, but I am missing something

835

asked Jun 13 '11 20:06

brokentypewriter

2 Answers

islice() can be used to get the next n items of an iterator. Thus, list(islice(f, n)) will return a list of the next n lines of the file f. Using this inside a loop will give you the file in chunks of n lines. At the end of the file, the list might be shorter, and finally the call will return an empty list.

from itertools import islice with open(...) as f:     while True:         next_n_lines = list(islice(f, n))         if not next_n_lines:             break         # process next_n_lines

An alternative is to use the grouper pattern:

with open(...) as f:     for next_n_lines in izip_longest(*[f] * n):         # process next_n_lines

130

answered Sep 21 '22 23:09

Sven Marnach

The question appears to presume that there is efficiency to be gained by reading an "enormous textfile" in blocks of N lines at a time. This adds an application layer of buffering over the already highly optimized stdio library, adds complexity, and probably buys you absolutely nothing.

Thus:

with open('my_very_large_text_file') as f:     for line in f:         process(line)

is probably superior to any alternative in time, space, complexity and readability.

See also Rob Pike's first two rules, Jackson's Two Rules, and PEP-20 The Zen of Python. If you really just wanted to play with islice you should have left out the large file stuff.

answered Sep 19 '22 23:09

msw

Related questions
                            
                                What does calling fit() multiple times on the same model do?
                            
                                How to print UTF-8 encoded text to the console in Python < 3?
                            
                                How to write to stdout AND to log file simultaneously with Popen?
                            
                                What is the safest way to removing Python framework files that are located in different place than Brew installs
                            
                                Python interface for R Programming Language [duplicate]
                            
                                Does `anaconda` create a separate PYTHONPATH variable for each new environment?
                            
                                Correct way to set new column in pandas DataFrame to avoid SettingWithCopyWarning
                            
                                How do you access an authenticated Google App Engine service from a (non-web) python client?
                            
                                Why does pip freeze report some packages in a fresh virtualenv created with --no-site-packages?
                            
                                Can you perform multi-threaded tasks within Django?
                            
                                How do I transpose dataframe in pandas without index?
                            
                                What does the "yield from" syntax do in asyncio and how is it different from "await"
                            
                                Tab completion in Python's raw_input()
                            
                                Big-O of list slicing
                            
                                What does Django's @property do?
                            
                                Simplest way of checking for string that contains a string in list? [duplicate]
                            
                                Cross-correlation (time-lag-correlation) with pandas?
                            
                                How to apply "first" and "last" functions to columns while using group by in pandas?
                            
                                Python pytz timezone function returns a timezone that is off by 9 minutes
                            
                                Can Cython compile to an EXE?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With