Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

memory use in large data-structures manipulation/processing

I have a number of large (~100 Mb) files which I'm regularly processing. While I'm trying to delete unneeded data structures during processing, memory consumption is a bit too high. I was wondering if there is a way to efficiently manipulate large data, e.g.:

def read(self, filename):
    fc = read_100_mb_file(filename)
    self.process(fc)
def process(self, content):
    # do some processing of file content

Is there a duplication of data structures? Isn't it more memory efficient to use a class-wide attribute like self.fc?

When should I use garbage collection? I know about the gc module, but do I call it after I del fc for example?

update
p.s. 100 Mb is not a problem in itself. but float conversion, further processing add significantly more to both working set and virtual size (I'm on Windows).

like image 952
SilentGhost Avatar asked Dec 04 '22 16:12

SilentGhost


2 Answers

I'd suggest looking at the presentation by David Beazley on using generators in Python. This technique allows you to handle a lot of data, and do complex processing, quickly and without blowing up your memory use. IMO, the trick isn't holding a huge amount of data in memory as efficiently as possible; the trick is avoiding loading a huge amount of data into memory at the same time.

like image 107
Ryan Ginstrom Avatar answered Jan 08 '23 08:01

Ryan Ginstrom


Before you start tearing your hair out over the garbage collector, you might be able to avoid that 100mb hit of loading the entire file into memory by using a memory-mapped file object. See the mmap module.

like image 39
Crashworks Avatar answered Jan 08 '23 09:01

Crashworks