Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split file of size bigger than memory available?

Tags:

Let's say I have only 8G of heap space available and I would like to butcher up a file bigger than that into a series of smaller files. If I try

with open(fname) as f:
    content = f.readlines()

I will run out of memory because it attempts to load the whole file. Is there a way to open the file without loading the whole thing in memory and just take lines X through Y?

like image 491
amphibient Avatar asked Apr 10 '17 15:04

amphibient


1 Answers

itertools.islice is a good tool for the job, but you need to consider how to use it efficiently. For instance, islice(f, 10, 20) discards 10 lines then emits twenty so it isn't a good way to do the writes. Depending on how you write your loop, you either drop data or rescan the file for each write.

Its also not obvious to know when you are done. fileobj.writelines(isslice(f, 10)) will happily write 0 line files until the end of time. You really only know you are done after the fact, so you can test if you wrote a zero-length file to terminate.

In this example my big file is 100 lines long and I break into 10 lines apeice.... that's a bit quicker to test than an 8gig file.

import itertools
import os

lines_per_file = 10

with open('big.txt') as infp:
    # file counter used to create unique output files
    for file_count in itertools.count(1):
        out_filename = 'out-{}.txt'.format(file_count)
        with open(out_filename, 'w') as outfp:
            # write configured number of lines to file
            outfp.writelines(itertools.islice(infp, lines_per_file))
        # break when no extra data written
        if os.stat(out_filename).st_size == 0:
            os.remove(out_filename)
            break
like image 129
tdelaney Avatar answered Sep 21 '22 10:09

tdelaney