I'm developing an application where I need to distribute a set of tasks across a potentially quite large cluster of different machines.
Ideally I'd like a very simple, idiomatic way to do this in Clojure, e.g. something like:
; create a clustered set of machines
(def my-cluster (new-cluster list-of-ip-addresses))
; define a task to be executed
(deftask my-task (my-function arg1 arg2))
; run a task 10000 times on the cluster
(def my-job (run-task my-cluster my-task {:repeat 10000})
; do something with the results:
(some-function (get-results my-job))
Bonus if it can do something like Map-Reduce on the cluster as well.....
What's the best way to achieve something like this? Maybe I could wrap an appropriate Java library?
UPDATE:
Thanks for all the suggestion of Apache Hadoop - looks like it might fit the bill, however it seem a bit like overkill since I'm not needing a distributed data storage system like Hadoop uses (i.e. i don't need to process billions of records)... something more lightweight and focused on compute tasks only would be preferable if it exists.
Hadoop is the base for almost all the large scale big data excitement in the Clojure world these days though there are better ways than using Hadoop directly.
Cascalog is a very popular front end:
Cascalog is a tool for processing data on Hadoop with Clojure in a concise and expressive manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.
Also check out Amit Rathor's swarmiji distributed worker framework build on top of RabbitMQ. it's less focused on data processing and more on distributing a fixed number of tasks to a pool of available computing power. (P.S. It's in his book, Clojure in Action)
Although I haven't gotten to use it yet, I think that Storm is something that you might find useful to explore:
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Hadoop is exacly what you need: Apache Hadoop
Storm might suit your needs better than Hadoop, as it has no distributed data storage and has low latency. It's possible to split up and process data, similar to MapReduce, the Trident api makes this very simple.
It is partly written in Clojure, so I suppose Clojure interop is easier.
Another option is Onyx which offers similar functionality, but is a pure Clojure based project.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With