Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sharing dictionary in multiprocessing with python

In my program I need to share a dictionary between processes in multiprocessing with Python. I've simplified the code to put here an example:

import multiprocessing
def folding (return_dict, seq):
    dis = 1
    d = 0
    ddg = 1 
    '''This is irrelevant, actually my program sends seq parameter to other extern program that returns dis, d and ddg parameters'''
    return_dict [seq] = [dis, d, ddg]

seqs = ['atcgtg', 'agcgatcg', 'atcgatcgatc', atcggatcg', agctgctagct']
manager = Manager()
return_dict = manager.dict()
n_cores = 3

for i in range (0, len(seqs), n_cores) #n_cores is the number of cores availables in the computer, defined by the user
    subseqs = seqs[i:i + n_cores]
    processes = [Process(target=folding, args =(return_dict, seq)) for seq in subseqs]
    for p in processes:
        p.start()
    for p in processes:
        p.join()

for i in retun_dict:
    print i

I expected to have at the end of the program the return_dict with all of the property values. When I run my program, that has to do this with seqs list of thousand of sequences and repeat it lot of times, sometimes I get the correct result, but sometimes (the most part of the times) the program never ends, but retuns any error, and I have no idea what is going wrong. Futhermore, I think that this is not very efficient in time, I wanted to know if there are other way to do this most efficiently and faster. Thank you every body!

like image 292
Irene Díaz Avatar asked Feb 11 '23 19:02

Irene Díaz


1 Answers

With fixing some minor syntax errors your code seems to work.

However, I would use a multiprocessing pool instead of your custom solution to run always n_cores processes at a time. The problem with your approach is that all processes need to be finished before you start the next batch. Depending on how variable the time is which you need to compute folding, you can encounter a bottleneck. In the worst case, this means no speed up whatsoever compared to single core processing.

Moreover, your program will run into serious problems under Windows. You need to make sure that your main module can be safely imported without re-running your multiprocessing code. That is, you need to protect your main entry point via if __name__ == '__main___' which you might have seen already in other python scripts. This will make sure that your protected code will only be executed if your script is started as the main file from the interpreter, i.e. only once and not by every new child process you spawn.

Here is my slightly changed version of your code using a pool:

import multiprocessing as mp


def folding(return_dict, seq):
    dis = 1
    d = 0
    ddg = 1
    '''This is irrelevant, actually my program sends seq parameter to other extern program that returns dis, d and ddg parameters'''
    return_dict[seq] = [dis, d, ddg]


def main():
    seqs = ['atcgtg', 'agcgatcg', 'atcgatcgatc', 'atcggatcg', 'agctgctagct']
    manager = mp.Manager()
    return_dict = manager.dict()
    n_cores = 3

    # created pool running maximum 3 cores
    pool = mp.Pool(n_cores)

    # Execute the folding task in parallel
    for seq in seqs:
        pool.apply_async(folding, args=(return_dict, seq))

    # Tell the pool that there are no more tasks to come and join
    pool.close()
    pool.join()

    # Print the results
    for i in return_dict.keys():
        print(i, return_dict[i])


if __name__ == '__main__':
    # Protected main function
    main()

and this will print

atcgtg [1, 0, 1]
atcgatcgatc [1, 0, 1]
agcgatcg [1, 0, 1]
atcggatcg [1, 0, 1]
agctgctagct [1, 0, 1]

Example without Shared Data

EDIT: Also in your case there is actually no need to have a shared data structure. You could simply rely on the pool's map function. map takes an iterable which is then used to call the function folding with all elements of the iterable once. Using map over map_asnyc has the advantage that the results are in order of the inputs. But you need to wait until all results are gathered until you can process them.

Here is an example using map. Note that your function folding now returns the result instead of putting it into a shared dictionary:

import multiprocessing as mp


def folding(seq):
    dis = 1
    d = 0
    ddg = 1
    '''This is irrelevant, actually my program sends seq parameter to other extern program that returns dis, d and ddg parameters'''
    # Return results instead of using shared data
    return [dis, d, ddg]


def main():
    seqs = ['atcgtg', 'agcgatcg', 'atcgatcgatc', 'atcggatcg', 'agctgctagct']
    n_cores = 3

    pool = mp.Pool(n_cores)

    # Use map which blocks until all results are ready:
    res = pool.map(folding, iterable=seqs)

    pool.close()
    pool.join()

    # Print the results
    # Using list(zip(..)) to print inputs next to outputs
    print(list(zip(seqs, res)))


if __name__ == '__main__':
    main()

and this one prints

[('atcgtg', [1, 0, 1]), ('agcgatcg', [1, 0, 1]), ('atcgatcgatc', [1, 0, 1]), ('atcggatcg', [1, 0, 1]), ('agctgctagct', [1, 0, 1])]
like image 181
SmCaterpillar Avatar answered Feb 13 '23 09:02

SmCaterpillar