Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does my loop require more memory on each iteration?

I am trying to reduce the memory requirements of my python 3 code. Right now each iteration of the for loop requires more memory than the last one.

I wrote a small piece of code that has the same behaviour as my project:

import numpy as np
from multiprocessing import Pool
from itertools import repeat


def simulation(steps, y):  # the function that starts the parallel execution of f()
    pool = Pool(processes=8, maxtasksperchild=int(steps/8))
    results = pool.starmap(f, zip(range(steps), repeat(y)), chunksize=int(steps/8))
    pool.close()
    return results


def f(steps, y):  # steps is used as a counter. My code doesn't need it.
        a, b = np.random.random(2)
        return y*a, y*b

def main():
    steps = 2**20  # amount of times a random sample is taken
    y = np.ones(5)  # dummy variable to show that the next iteration of the code depends on the previous one
    total_results = np.zeros((0,2))
    for i in range(5):
        results = simulation(steps, y[i-1])
        y[i] = results[0][0]
        total_results = np.vstack((total_results, results))

    print(total_results, y)

if __name__ == "__main__":
    main()

For each iteration of the for loop the threads in simulation() each have a memory usage equal to the total memory used by my code.

Does Python clone my entire environment each time the parallel processes are run, including the variables not required by f()? How can I prevent this behaviour?

Ideally I would want my code to only copy the memory it requires to execute f() while I can save the results in memory.

like image 514
Isea Avatar asked Mar 17 '16 17:03

Isea


People also ask

Do loops use less memory?

There is no difference in memory use between the programming language construct for and while . A given compiler might optimize one or the other better but you should not worry about this.

Which loop is more efficient in Python?

I think the answer here is a little more subtle than the other answers suggest, though the gist of it is correct: the for loop is faster because more of the operations happen in C and less in Python.


1 Answers

Though the script does use quite a bit of memory even with the "smaller" example values, the answer to

Does Python clone my entire environment each time the parallel processes are run, including the variables not required by f()? How can I prevent this behaviour?

is that it does in a way clone the environment with forking a new process, but if copy-on-write semantics are available, no actual physical memory needs to be copied until it is written to. For example on this system

 % uname -a 
Linux mypc 4.2.0-27-generic #32-Ubuntu SMP Fri Jan 22 04:49:08 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

COW seems to be available and in use, but this may not be the case on other systems. On Windows this is strictly different as a new Python interpreter is executed from .exe instead of forking. Since you mention using htop, you're using some flavour of UNIX or UNIX like system, and you get COW semantics.

For each iteration of the for loop the processes in simulation() each have a memory usage equal to the total memory used by my code.

The spawned processes will display almost identical values of RSS, but this can be misleading, because mostly they occupy the same actual physical memory mapped to multiple processes, if writes do not occur. With Pool.map the story is a bit more complicated, since it "chops the iterable into a number of chunks which it submits to the process pool as separate tasks". This submitting happens over IPC and submitted data will be copied. In your example the IPC and 2**20 function calls also dominate the CPU usage. Replacing the mapping with a single vectorized multiplication in simulation took the script's runtime from around 150s to 0.66s on this machine.

We can observe COW with a (somewhat) simplified example that allocates a large array and passes it to a spawned process for read-only processing:

import numpy as np
from multiprocessing import Process, Condition, Event
from time import sleep
import psutil


def read_arr(arr, done, stop):
    with done:
        S = np.sum(arr)
        print(S)
        done.notify()
    while not stop.is_set(): 
        sleep(1)


def main():
    # Create a large array
    print('Available before A (MiB):', psutil.virtual_memory().available / 1024 ** 2)
    input("Press Enter...")
    A = np.random.random(2**28)
    print('Available before Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
    input("Press Enter...")
    done = Condition()
    stop = Event()
    p = Process(target=read_arr, args=(A, done, stop))
    with done:
        p.start()
        done.wait()
    print('Available with Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
    input("Press Enter...")
    stop.set()
    p.join()

if __name__ == '__main__':
    main()

Output on this machine:

 % python3 test.py
Available before A (MiB): 7779.25
Press Enter...
Available before Process (MiB): 5726.125
Press Enter...
134221579.355
Available with Process (MiB): 5720.79296875
Press Enter...

Now if we replace the function read_arr with a function that modifies the array:

def mutate_arr(arr, done, stop):
    with done:
        arr[::4096] = 1
        S = np.sum(arr)
        print(S)
        done.notify()
    while not stop.is_set(): 
        sleep(1)

the results are quite different:

Available before A (MiB): 7626.12109375
Press Enter...
Available before Process (MiB): 5571.82421875
Press Enter...
134247509.654
Available with Process (MiB): 3518.453125
Press Enter...

The for-loop does indeed require more memory after each iteration, but that's obvious: it stacks the total_results from the mapping, so it has to allocate space for a new array to hold both the old results and the new and free the now unused array of old results.

like image 85
Ilja Everilä Avatar answered Oct 20 '22 01:10

Ilja Everilä