In my program I need to share a dictionary between processes in multiprocessing with Python. I've simplified the code to put here an example:
import multiprocessing
def folding (return_dict, seq):
dis = 1
d = 0
ddg = 1
'''This is irrelevant, actually my program sends seq parameter to other extern program that returns dis, d and ddg parameters'''
return_dict [seq] = [dis, d, ddg]
seqs = ['atcgtg', 'agcgatcg', 'atcgatcgatc', atcggatcg', agctgctagct']
manager = Manager()
return_dict = manager.dict()
n_cores = 3
for i in range (0, len(seqs), n_cores) #n_cores is the number of cores availables in the computer, defined by the user
subseqs = seqs[i:i + n_cores]
processes = [Process(target=folding, args =(return_dict, seq)) for seq in subseqs]
for p in processes:
p.start()
for p in processes:
p.join()
for i in retun_dict:
print i
I expected to have at the end of the program the return_dict with all of the property values. When I run my program, that has to do this with seqs list of thousand of sequences and repeat it lot of times, sometimes I get the correct result, but sometimes (the most part of the times) the program never ends, but retuns any error, and I have no idea what is going wrong. Futhermore, I think that this is not very efficient in time, I wanted to know if there are other way to do this most efficiently and faster. Thank you every body!
With fixing some minor syntax errors your code seems to work.
However, I would use a multiprocessing pool instead of your custom solution to run always n_cores
processes at a time. The problem with your approach is that all processes need to be finished before you start the next batch. Depending on how variable the time is which you need to compute folding
, you can encounter a bottleneck. In the worst case, this means no speed up whatsoever compared to single core processing.
Moreover, your program will run into serious problems under Windows. You need to make sure that your main module can be safely imported without re-running your multiprocessing code. That is, you need to protect your main entry point via if __name__ == '__main___'
which you might have seen already in other python scripts. This will make sure that your protected code will only be executed if your script is started as the main file from the interpreter, i.e. only once and not by every new child process you spawn.
Here is my slightly changed version of your code using a pool:
import multiprocessing as mp
def folding(return_dict, seq):
dis = 1
d = 0
ddg = 1
'''This is irrelevant, actually my program sends seq parameter to other extern program that returns dis, d and ddg parameters'''
return_dict[seq] = [dis, d, ddg]
def main():
seqs = ['atcgtg', 'agcgatcg', 'atcgatcgatc', 'atcggatcg', 'agctgctagct']
manager = mp.Manager()
return_dict = manager.dict()
n_cores = 3
# created pool running maximum 3 cores
pool = mp.Pool(n_cores)
# Execute the folding task in parallel
for seq in seqs:
pool.apply_async(folding, args=(return_dict, seq))
# Tell the pool that there are no more tasks to come and join
pool.close()
pool.join()
# Print the results
for i in return_dict.keys():
print(i, return_dict[i])
if __name__ == '__main__':
# Protected main function
main()
and this will print
atcgtg [1, 0, 1]
atcgatcgatc [1, 0, 1]
agcgatcg [1, 0, 1]
atcggatcg [1, 0, 1]
agctgctagct [1, 0, 1]
EDIT: Also in your case there is actually no need to have a shared data structure. You could simply rely on the pool's map function. map takes an iterable which is then used to call the function folding
with all elements of the iterable once. Using map over map_asnyc has the advantage that the results are in order of the inputs. But you need to wait until all results are gathered until you can process them.
Here is an example using map. Note that your function folding
now returns the result instead of putting it into a shared dictionary:
import multiprocessing as mp
def folding(seq):
dis = 1
d = 0
ddg = 1
'''This is irrelevant, actually my program sends seq parameter to other extern program that returns dis, d and ddg parameters'''
# Return results instead of using shared data
return [dis, d, ddg]
def main():
seqs = ['atcgtg', 'agcgatcg', 'atcgatcgatc', 'atcggatcg', 'agctgctagct']
n_cores = 3
pool = mp.Pool(n_cores)
# Use map which blocks until all results are ready:
res = pool.map(folding, iterable=seqs)
pool.close()
pool.join()
# Print the results
# Using list(zip(..)) to print inputs next to outputs
print(list(zip(seqs, res)))
if __name__ == '__main__':
main()
and this one prints
[('atcgtg', [1, 0, 1]), ('agcgatcg', [1, 0, 1]), ('atcgatcgatc', [1, 0, 1]), ('atcggatcg', [1, 0, 1]), ('agctgctagct', [1, 0, 1])]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With