I am trying to reduce the memory requirements of my python 3 code. Right now each iteration of the for loop requires more memory than the last one.
I wrote a small piece of code that has the same behaviour as my project:
import numpy as np
from multiprocessing import Pool
from itertools import repeat
def simulation(steps, y): # the function that starts the parallel execution of f()
pool = Pool(processes=8, maxtasksperchild=int(steps/8))
results = pool.starmap(f, zip(range(steps), repeat(y)), chunksize=int(steps/8))
pool.close()
return results
def f(steps, y): # steps is used as a counter. My code doesn't need it.
a, b = np.random.random(2)
return y*a, y*b
def main():
steps = 2**20 # amount of times a random sample is taken
y = np.ones(5) # dummy variable to show that the next iteration of the code depends on the previous one
total_results = np.zeros((0,2))
for i in range(5):
results = simulation(steps, y[i-1])
y[i] = results[0][0]
total_results = np.vstack((total_results, results))
print(total_results, y)
if __name__ == "__main__":
main()
For each iteration of the for loop the threads in simulation() each have a memory usage equal to the total memory used by my code.
Does Python clone my entire environment each time the parallel processes are run, including the variables not required by f()? How can I prevent this behaviour?
Ideally I would want my code to only copy the memory it requires to execute f() while I can save the results in memory.
There is no difference in memory use between the programming language construct for and while . A given compiler might optimize one or the other better but you should not worry about this.
I think the answer here is a little more subtle than the other answers suggest, though the gist of it is correct: the for loop is faster because more of the operations happen in C and less in Python.
Though the script does use quite a bit of memory even with the "smaller" example values, the answer to
Does Python clone my entire environment each time the parallel processes are run, including the variables not required by f()? How can I prevent this behaviour?
is that it does in a way clone the environment with forking a new process, but if copy-on-write semantics are available, no actual physical memory needs to be copied until it is written to. For example on this system
% uname -a
Linux mypc 4.2.0-27-generic #32-Ubuntu SMP Fri Jan 22 04:49:08 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
COW
seems to be available and in use, but this may not be the case on other systems. On Windows this is strictly different as a new Python interpreter is executed from .exe
instead of forking. Since you mention using htop
, you're using some flavour of UNIX or UNIX like system, and you get COW
semantics.
For each iteration of the for loop the processes in simulation() each have a memory usage equal to the total memory used by my code.
The spawned processes will display almost identical values of RSS
, but this can be misleading, because mostly they occupy the same actual physical memory mapped to multiple processes, if writes do not occur. With Pool.map
the story is a bit more complicated, since it "chops the iterable into a number of chunks which it submits to the process pool as separate tasks". This submitting happens over IPC
and submitted data will be copied. In your example the IPC
and 2**20 function calls also dominate the CPU usage. Replacing the mapping with a single vectorized multiplication in simulation
took the script's runtime from around 150s to 0.66s on this machine.
We can observe COW
with a (somewhat) simplified example that allocates a large array and passes it to a spawned process for read-only processing:
import numpy as np
from multiprocessing import Process, Condition, Event
from time import sleep
import psutil
def read_arr(arr, done, stop):
with done:
S = np.sum(arr)
print(S)
done.notify()
while not stop.is_set():
sleep(1)
def main():
# Create a large array
print('Available before A (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
A = np.random.random(2**28)
print('Available before Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
done = Condition()
stop = Event()
p = Process(target=read_arr, args=(A, done, stop))
with done:
p.start()
done.wait()
print('Available with Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
stop.set()
p.join()
if __name__ == '__main__':
main()
Output on this machine:
% python3 test.py
Available before A (MiB): 7779.25
Press Enter...
Available before Process (MiB): 5726.125
Press Enter...
134221579.355
Available with Process (MiB): 5720.79296875
Press Enter...
Now if we replace the function read_arr
with a function that modifies the array:
def mutate_arr(arr, done, stop):
with done:
arr[::4096] = 1
S = np.sum(arr)
print(S)
done.notify()
while not stop.is_set():
sleep(1)
the results are quite different:
Available before A (MiB): 7626.12109375
Press Enter...
Available before Process (MiB): 5571.82421875
Press Enter...
134247509.654
Available with Process (MiB): 3518.453125
Press Enter...
The for-loop does indeed require more memory after each iteration, but that's obvious: it stacks the total_results
from the mapping, so it has to allocate space for a new array to hold both the old results and the new and free the now unused array of old results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With