Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generator using item[n-1] + item[n] memory

I have a Python 3.x program that processes several large text files that contain sizeable arrays of data that can occasionally brush up against the memory limit of my puny workstation. From some basic memory profiling, it seems like when using the generator, the memory usage of my script balloons to hold consecutive elements, using up to twice the memory I expect.

I made a simple, stand alone example to test the generator and I get similar results in Python 2.7, 3.3, and 3.4. My test code follows, memory_usage() is a modifed version of this function from an SO question which uses /proc/self/status and agrees with top as I watch it. resource is probably a more cross-platform method:

import sys, resource, gc, time

def biggen():
    sizes = 1, 1, 10, 1, 1, 10, 10, 1, 1, 10, 10, 20, 1, 1, 20, 20, 1, 1
    for size in sizes:
        data = [1] * int(size * 1e6)
        #time.sleep(1)
        yield data

def consumer():
    for data in biggen():
        rusage = resource.getrusage(resource.RUSAGE_SELF)
        peak_mb = rusage.ru_maxrss/1024.0
        print('Peak: {0:6.1f} MB, Data Len: {1:6.1f} M'.format(
                peak_mb, len(data)/1e6))
        #print(memory_usage()) # 

        data = None  # go
        del data     # away
        gc.collect() # please.

# def memory_usage():
#     """Memory usage of the current process, requires /proc/self/status"""
#     # https://stackoverflow.com/a/898406/194586
#     result = {'peak': 0, 'rss': 0}
#     for line in open('/proc/self/status'):
#         parts = line.split()
#         key = parts[0][2:-1].lower()
#         if key in result:
#             result[key] = int(parts[1])/1024.0
#     return 'Peak: {peak:6.1f} MB, Current: {rss:6.1f} MB'.format(**result)

print(sys.version)
consumer()

In practice I'll process data coming from such a generator loop, saving just what I need, then discard it.

When I run the above script, and two large elements come in series (the data size can be highly variable), it seems like Python computes the next before freeing the previous, leading to up to double the memory usage.

$ python genmem.py 
2.7.3 (default, Sep 26 2013, 20:08:41) 
[GCC 4.6.3]
Peak:    7.9 MB, Data Len:    1.0 M
Peak:   11.5 MB, Data Len:    1.0 M
Peak:   45.8 MB, Data Len:   10.0 M
Peak:   45.9 MB, Data Len:    1.0 M
Peak:   45.9 MB, Data Len:    1.0 M
Peak:   45.9 MB, Data Len:   10.0 M
#        ^^  not much different versus previous 10M-list
Peak:   80.2 MB, Data Len:   10.0 M
#        ^^  same list size, but new memory peak at roughly twice the usage
Peak:   80.2 MB, Data Len:    1.0 M
Peak:   80.2 MB, Data Len:    1.0 M
Peak:   80.2 MB, Data Len:   10.0 M
Peak:   80.2 MB, Data Len:   10.0 M
Peak:  118.3 MB, Data Len:   20.0 M
#        ^^  and again...  (20+10)*x
Peak:  118.3 MB, Data Len:    1.0 M
Peak:  118.3 MB, Data Len:    1.0 M
Peak:  118.3 MB, Data Len:   20.0 M
Peak:  156.5 MB, Data Len:   20.0 M
#        ^^  and again. (20+20)*x
Peak:  156.5 MB, Data Len:    1.0 M
Peak:  156.5 MB, Data Len:    1.0 M

The crazy belt-and-suspenders-and-duct-tape approach data = None, del data, and gc.collect() does nothing.

I'm pretty sure the generator itself is not doubling up on memory because otherwise a single large value it yields would increase the peak usage, and in the same iteration a large object appeared; it's only large consecutive objects.

How can I save my memory?

like image 252
Nick T Avatar asked Feb 14 '14 18:02

Nick T


Video Answer


1 Answers

The problem is in the generator function; particularly in the statement:

    data = [1] * int(size * 1e6)

Suppose you have old content in the data variable. When you run this statement, it first computes the result, thus you have 2 these arrays in memory; old and new. Only then is data variable changed to point to the new structure and the old structure is released. Try to modify the iterator function to:

def biggen():
    sizes = 1, 1, 10, 1, 1, 10, 10, 1, 1, 10, 10, 20, 1, 1, 20, 20, 1, 1
    for size in sizes:
        data = None
        data = [1] * int(size * 1e6)
        yield data
like image 108
ondra Avatar answered Oct 31 '22 13:10

ondra