I run into the following problem when writing scientific code with Python:
Being an engineer, I know all about the right architecture for this (with work items queued, and worker threads or processes, and work results queued and written to persistent store); but I don't want to implement this myself. The most problematic issue is the need for reruns due to code changes or temporary system issues (e.g. out-of-memory).
I would like to find some framework to which I will provide the wanted inputs (e.g. with a file with one line per run) and then I will be able to just initiate multiple instances of some framework-provided agent which will run my code. If something went bad with the run (e.g. temporary system issue or thrown exception due to bug) I will be able to delete results and run some more agents. If I take too many resources, I will be able to kill some agents without a fear of data-inconsistency, and other agents will pick-up the work-items when they find the time.
Any existing solution? Anyone wishes to share his code which do just that? Thanks!
I might be wrong, but simply using GNU command line utilities, like parallel
, or even xargs
, seems appropriate to me for this case. Usage might look like this:
cat inputs | parallel ./job.py --pipe > results 2> exceptions
This will execute job.py
for every line of inputs
in parallel, output successful results into results
, and failed ones to exceptions
. A lot of examples of usage (also for scientific Python scripts) can be found in this Biostars thread.
And, for completeness, Parallel documentation.
First of all, I would like to stress that the problem that Uri described in his question is indeed faced by many people doing scientific computing. It may be not easy to see if you work with a developed code base that has a well defined scope - things do not change as fast as in scientific computing or data analysis. This page has an excellent description why one would like to have a simple solution for parallelizing pieces of code.
So, this project is a very interesting attempt to solve the problem. I have not tried using it myself yet, but it looks very promising!
If you with "have access to a ~100 CPUs machines
" mean that you have access to 100 machines each having multiple CPUs and in case you want a system that is generic enough for different kinds of applications, then the best possible (and in my opinion only) solution is to have a management layer between your resources and your job input. This is nothing Python-specific at all, it is applicable in a much more general sense. Such a layer manages the computing resources, assigns tasks to single resource units and monitors the entire system for you. You make use of the resources via a well-defined interface as provided by the management system. Such as management system is usually called "batch system" or "job queueing system". Popular examples are SLURM, Sun Grid Engine, Torque, .... Setting each of them up is not trivial at all, but also your request is not trivial.
Python-based "parallel" solutions usually stop at the single-machine level via multiprocessing
. Performing parallelism beyond a single machine in an automatic fashion requires a well-configured resource cluster. It usually involves higher level mechanisms such as the message passing interface (MPI), which relies on a properly configured resource system. The corresponding configuration is done on the operating system and even hardware level on every single machine involved in the resource pool. Eventually, a parallel computing environment involving many single machines of homogeneous or heterogeneous nature requires setting up such a "batch system" in order to be used in a general fashion.
You realize that you don't get around the effort in properly implementing such a resource pool. But what you can do is totally isolate this effort form your application layer. You once implement such a managed resource pool in a generic fashion, ready to be used by any application from a common interface. This interface is usually implemented at the command line level by providing job submission, monitoring, deletion, ... commands. It is up to you to define what a job is and which resources it should consume. It is up to the job queueing system to assign your job to specific machines and it is up to the (properly configured) operating system and MPI library to make sure that the communication between machines is working.
In case you need to use hardware distributed among multiple machines for one single application and assuming that the machines can talk to each other via TCP/IP, there are Python-based solutions implementing so to say less general job queueing systems. You might want to have a look at http://python-rq.org/ or http://itybits.com/pyres/intro.html (there are many other comparable systems out there, all based on an independent messaging / queueing instance such as Redis or ZeroMQ).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With