I'd like to find a user-space tool (preferably in Python - barring that, in anything I could easily modify if it doesn't already do what I need it to) to replace a short script I've been using that does the two things below:
For example, using my current script, I can in a python prompt
>>> import hosts
>>> hosts.run_commands(['users']*5)
or from the command line
% hosts.py "users" "users" "users" "users" "users"
to run the command users
5 times (after finding 5 computers on which the command could be run by checking the cpu load and available memory on at least 5 computers from a config file). There should be no job server other than the script I just ran, and no worker daemons or processes on the computers that will run these commands.
I'd additionally like to be able to track the jobs, run jobs again on failure, etc., but these are extra features (very standard in a real job scheduler) that I don't actually need.
I've found good ssh libraries for Python, things like classh and PuSSH, which don't have the (very simple) load balancing features I'd like. On the other side of what I want is Condor or Slurm, as suggested by crispamares before I clarified I want something lighter. Those would be doing things the proper way, but from reading about them, they sounds like spinning them up in user space only when I need them would be annoying to impossible. This isn't a dedicated cluster, and I don't have root access on these hosts.
I'm currently planning to use a wrapper around classh with some basic polling of computers whenever I need to know how busy they are if I can't find something else.
There is fabric, I am surprised no one has not mentioned it.
Slurm is a powerful job scheduler that can be programmable in Python using PySlurm.
I don't know if it is harder than Condor to deploy. Also I don't know if it fits all your needs, but just in case, I write it down.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With