Parallel doseq for Clojure

I haven't used multithreading in Clojure at all so am unsure where to start.

I have a doseq whose body can run in parallel. What I'd like is for there always to be 3 threads running (leaving 1 core free) that evaluate the body in parallel until the range is exhausted. There's no shared state, nothing complicated - the equivalent of Python's multiprocessing would be just fine.

So something like:

(dopar 3 [i (range 100)]
  ; repeated 100 times in 3 parallel threads...

Where should I start looking? Is there a command for this? A standard package? A good reference?

So far I have found pmap, and could use that (how do I restrict to 3 at a time? looks like it uses 32 at a time - no, source says 2 + number of processors), but it seems like this is a basic primitive that should already exist somewhere.

clarification: I really would like to control the number of threads. I have processes that are long-running and use a fair amount of memory, so creating a large number and hoping things work out OK isn't a good approach (example which uses a significant chunk available mem).

update: Starting to write a macro that does this, and I need a semaphore (or a mutex, or an atom i can wait on). Do semaphores exist in Clojure? Or should I use a ThreadPoolExecutor? It seems odd to have to pull so much in from Java - I thought parallel programming in Clojure was supposed to be easy... Maybe I am thinking about this completely the wrong way? Hmmm. Agents?

4 Answers

OK, I think what I want is to have an agent for each loop, with the data sent to the agent using send. The agents triggered using send are run from a thread pool, so the number is limited in some way (it doesn't give the fine-grained control of exactly three threads, but it'll have to do for now).

[Dave Ray explains in comments: to control pool size I'd need to write my own]

(defmacro dopar [seq-expr & body]
  (assert (= 2 (count seq-expr)) "single pair of forms in sequence expression")
  (let [[k v] seq-expr]
    `(apply await
       (for [k# ~v]
         (let [a# (agent k#)]
           (send a# (fn [~k] ~@body))

which can be used like:

(deftest test-dump
  (dopar [n (range 7 11)]
    (time (do-dump-single "/tmp/single" "a" n 10000000))))

Yay! Works! I rock! (OK, Clojure rocks a little bit too). Related blog post.

pmap will actually work fine in most circumstances - it uses a thread pool with a sensible number of threads for your machine. I wouldn't bother trying to create your own mechanisms to control the number of threads unless you have real benchmark evidence that the defaults are causing a problem.

Having said that, if you really want to limit to a maximum of three threads, an easy approach is to just use pmap on 3 subsets of the range:

(defn split-equally [num coll] 
  "Split a collection into a vector of (as close as possible) equally sized parts"
  (loop [num num 
         parts []
         coll coll
         c (count coll)]
    (if (<= num 0)
      (let [t (quot (+ c num -1) num)]
        (recur (dec num) (conj parts (take t coll)) (drop t coll) (- c t)))))) 

(defmacro dopar [thread-count [sym coll] & body]
 `(doall (pmap 
    (fn [vals#]
      (doseq [~sym vals#]
    (split-equally ~thread-count ~coll))))

Note the use of doall, which is needed to force evaluation of the pmap (which is lazy).

There's actually a library now for doing exactly this. From their github:

The claypoole library provides threadpool-based parallel versions of Clojure functions such as pmap, future, and for.

It provides both ordered/unordered versions for the same.

Why don't you just use pmap? You still can't control the threadpool, but it's a lot less work than writing a custom macro that uses agents (why not futures?).

