Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

updating a shelve dictionary in python parallely

I have a program that takes a very huge input file and makes a dict out of it. Since there is no way this is going to fit in memory, I Decided to use shelve to write it to my disk. Now I need to take advantage of the multiple cores available in my system (8 of them) so that I can speed up my parsing. The most obvious way to do this I thought was to split my input file into 8 parts and run the code on all 8 parts concurrently. The problem is that I need only 1 dictionary in the end. Not 8 of them. So how do I use shelve to update one single dictionary parallely?

like image 977
Amitash Avatar asked Dec 08 '22 22:12

Amitash


2 Answers

I gave a pretty detailed answer here on Processing single file from multiple processes in python

Don't try to figure out how you can have many processes write to a shelve at once. Think about how you can have a single process deliver results to the shelve.

The idea is that you have a single process producing the input to a queue. Then you have as many workers as you want receiving queued items and doing the work. When they are done, they place the result into a result queue for the sink to read. The benefit is that you do not have to manually split up your work ahead of time. Just produce the "input" and let whatever worker is read take it and work on it.

With this pattern, you can scale up or down the workers based on the system capabilities.

like image 117
jdi Avatar answered Dec 31 '22 00:12

jdi


shelve doesn't support concurrent access. There are a few options for accomplishing what you want:

  1. Make one shelf per process and then merge at the end.

  2. Have worker processes send their results back to the master process over eg multiprocessing.Pipe; the master then stores them in the shelf.

  3. I think you can get bsddb to work with concurrent access in a shelve-like API, but I've never had the need to do so.

like image 38
Danica Avatar answered Dec 31 '22 00:12

Danica