Let's say I have only 8G of heap space available and I would like to butcher up a file bigger than that into a series of smaller files. If I try
with open(fname) as f:
content = f.readlines()
I will run out of memory because it attempts to load the whole file. Is there a way to open the file without loading the whole thing in memory and just take lines X through Y?
itertools.islice
is a good tool for the job, but you need to consider how to use it efficiently. For instance, islice(f, 10, 20)
discards 10 lines then emits twenty so it isn't a good way to do the writes. Depending on how you write your loop, you either drop data or rescan the file for each write.
Its also not obvious to know when you are done. fileobj.writelines(isslice(f, 10))
will happily write 0 line files until the end of time. You really only know you are done after the fact, so you can test if you wrote a zero-length file to terminate.
In this example my big file is 100 lines long and I break into 10 lines apeice.... that's a bit quicker to test than an 8gig file.
import itertools
import os
lines_per_file = 10
with open('big.txt') as infp:
# file counter used to create unique output files
for file_count in itertools.count(1):
out_filename = 'out-{}.txt'.format(file_count)
with open(out_filename, 'w') as outfp:
# write configured number of lines to file
outfp.writelines(itertools.islice(infp, lines_per_file))
# break when no extra data written
if os.stat(out_filename).st_size == 0:
os.remove(out_filename)
break
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With