I am using python to do some analysis on certain datasets and this process generates huge lists/dictionaries which maximally consume upto 30% (as reported by top
) of RAM (24GB). There are ~400 such data files and each has to be processed. Therefore I cannot run more than two jobs at a time (otherwise my system hangs). Finishing the analysis of each file takes few minutes and the entire data takes close to two days.
The only solution is to use parallel processing and to implement it i need to create functions that will execute the tasks.
The first step remains the same- open the file, read, split and store as a list. Usually I do the analysis on the list- get another list and then delete the previous one to save memory. However, if I use multiprocessing I would have to pass this list as an argument to some function.
global
a possible way ?Example:
# OPEN FILE #
f=open(args.infile,'r')
a=f.read()
f.close()
mall=findall('[^%]+',a)
del a
lm=len(mall)
m=[]
for i in range(args.numcores):
if i<args.numcores-1:
m[i]=mall[i*args.numcores:(i+1)*args.numcores]
else:
m[i]=mall[i*args.numcores:lm]
del mall
then pass it to a function fun(<list>)
In this case for each process: fun(m[i])
No, there's no copy made of the object. Parameters passed to a function reference the same object as the caller.
Deleting the variable within the function won't help, since there's still a reference at the calling site. Garbage collection won't occur until all references are gone.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With