Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributed computing framework for Clojure/Java

I'm developing an application where I need to distribute a set of tasks across a potentially quite large cluster of different machines.

Ideally I'd like a very simple, idiomatic way to do this in Clojure, e.g. something like:

; create a clustered set of machines
(def my-cluster (new-cluster list-of-ip-addresses))

; define a task to be executed
(deftask my-task (my-function arg1 arg2))

; run a task 10000 times on the cluster
(def my-job (run-task my-cluster my-task {:repeat 10000})

; do something with the results:
(some-function (get-results my-job))

Bonus if it can do something like Map-Reduce on the cluster as well.....

What's the best way to achieve something like this? Maybe I could wrap an appropriate Java library?

UPDATE:

Thanks for all the suggestion of Apache Hadoop - looks like it might fit the bill, however it seem a bit like overkill since I'm not needing a distributed data storage system like Hadoop uses (i.e. i don't need to process billions of records)... something more lightweight and focused on compute tasks only would be preferable if it exists.

like image 345
mikera Avatar asked Feb 26 '11 16:02

mikera


4 Answers

Hadoop is the base for almost all the large scale big data excitement in the Clojure world these days though there are better ways than using Hadoop directly.

Cascalog is a very popular front end:

    Cascalog is a tool for processing data on Hadoop with Clojure in a concise and
    expressive manner. Cascalog combines two cutting edge technologies in Clojure 
    and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, 
    flexible, and robust.

Also check out Amit Rathor's swarmiji distributed worker framework build on top of RabbitMQ. it's less focused on data processing and more on distributing a fixed number of tasks to a pool of available computing power. (P.S. It's in his book, Clojure in Action)

like image 172
Arthur Ulfeldt Avatar answered Nov 15 '22 12:11

Arthur Ulfeldt


Although I haven't gotten to use it yet, I think that Storm is something that you might find useful to explore:

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, and is a lot of fun to use!

like image 26
David J. Avatar answered Nov 15 '22 12:11

David J.


Hadoop is exacly what you need: Apache Hadoop

like image 26
Thomas Jungblut Avatar answered Nov 15 '22 13:11

Thomas Jungblut


Storm might suit your needs better than Hadoop, as it has no distributed data storage and has low latency. It's possible to split up and process data, similar to MapReduce, the Trident api makes this very simple.

It is partly written in Clojure, so I suppose Clojure interop is easier.

Another option is Onyx which offers similar functionality, but is a pure Clojure based project.

like image 3
ChrisBlom Avatar answered Nov 15 '22 14:11

ChrisBlom