I just finished reading "Programming Concurrency on the JVM" by Venkat Subramaniam and in that book, the author uses as one of his examples, counting the file sizes in a directory tree. He shows implementations using no concurrency, using queues, using a latch, and using scala actors. On my system, all the concurrent implementations (queues, latch and scala actors) are able to run in under 9 seconds when iterating through my /usr directory (OSX 10.6.8, Core Duo 2 Ghz, Intel G1 ssd 160GB).
I'm learning Clojure and decided I'd port the Scala actor version to Clojure using agents. Unfortunately, I was averaging 11-12 seconds which is significantly slower than the others. After spending DAYS pulling my hair out, I discovered that the following bit of code was the culprit (processFile is a function I send to the file-processing agent(s):
(defn processFile
[fileProcessor collectorAgent ^String fileName]
(let [^File file-obj (File. ^String fileName)
fileTotals (transient {:files 0, :bytes 0})]
(cond
(.isDirectory file-obj)
(do
(doseq [^File dir (.listFiles file-obj) :when (.isDirectory dir)]
(send collectorAgent addFileToProcess (.getPath dir)))
(send collectorAgent tallyResult *agent*)
(reduce (fn [currentTotal newItem] (assoc! currentTotal :files (inc (:files currentTotal))
:bytes (+ (:bytes currentTotal) newItem)))
fileTotals
(map #(.length ^File %) (filter #(.isFile ^File %) (.listFiles file-obj))))
(persistent! fileTotals))
(.isFile file-obj) (do (send collectorAgent tallyResult *agent*) {:files 1, :bytes (.length file-obj)}))))
You'll notice I tried using type-hints and a transient to improve performance, all to no avail. I replaced the above code with the following:
(defn processChildren
[children]
(loop [entries children files 0 bytes 0 dirs '()]
(let [^File child (first entries)]
(cond
(not (seq entries)) {:files files, :bytes bytes, :dirs dirs}
(.isFile child) (recur (rest entries) (inc files) (+ bytes (.length child)) dirs)
(.isDirectory child) (recur (rest entries) files bytes (conj dirs child))
:else (recur (rest entries) files bytes dirs)))))
(defn processFile
[fileProcessor collectorAgent ^String fileName]
(let [{files :files, bytes :bytes, dirs :dirs} (processChildren (.listFiles (File. fileName)))]
(doseq [^File dir dirs]
(send collectorAgent addFileToProcess (.getPath dir)))
(send collectorAgent tallyResult *agent*)
{:files files, :bytes bytes}))
This version performed on par if not faster than the Scala version and is almost identical to the algorithm used in the Scala version. I simply assumed that the functional approach to the algorithm would work just as well.
So...this long winded question boils down to the following: Why is the second version faster?
My hypothesis is that though the first version using map/filter/reduce on the contents of the directory is more "functional" than the second version's rather imperative processing of the directory, it is much less efficient because the directory's contents are being iterated through multiple times. Since filesystem I/O is slow, the entire program suffers.
Assuming I am right, is it not then safe to say that any recursive filesystem algorithm should prefer an imperative approach for performance?
I'm a total beginner at Clojure so feel free to rip my code to shreds if I'm doing something stupid or non-idiomatic.
I've edited the first version to make it more readable. I have a few comments, but no conclusively helpful statements:
You added transients and typehints with no real evidence as to what was slowing things down. It's entirely possible to slow things down dramatically with careless application of these operations, so it's a good idea to profile to find out what's actually slowing stuff down. Your choices seem reasonable, but I've removed the typehints that were obviously having no effect (eg, the compiler needs no hint to know that (File. ...) yields a File object).
Clojure (indeed, any lisp) strongly prefers some-agent
to someAgent
. The prefix syntax means there's no worry that -
can be parsed as a subtraction by a clueless compiler, so we can afford more well-spaced names.
You include calls to a bunch of functions that you don't define here at all, like tallyResult and addFileToProcess. Presumably they perform fine since you're using them in the performant version, but by not including them you've made it difficult for anyone else to poke around at it and see what speeds things up.
Consider send-off instead of send for I/O-bound operations: send uses a bounded threadpool to avoid swamping your processor. Here this probably doesn't matter since you're only using one agent and it serializes, but in future you'll run into cases where it matters.
Anyway, as promised, a more-legible rewrite of your first version:
(defn process-file
[_ collector-agent ^String file-name]
(let [file-obj (File. file-name)
file-totals (transient {:files 0, :bytes 0})]
(cond (.isDirectory file-obj)
(do
(doseq [^File dir (.listFiles file-obj)
:when (.isDirectory dir)]
(send collector-agent addFileToProcess
(.getPath dir)))
(send collector-agent tallyResult *agent*)
(reduce (fn [current-total new-item]
(assoc! current-total
:files (inc (:files current-total))
:bytes (+ (:bytes current-total) new-item)))
file-totals
(map #(.length ^File %)
(filter #(.isFile ^File %)
(.listFiles file-obj)))) -
(persistent! file-totals))
(.isFile file-obj)
(do (send collector-agent tallyResult *agent*)
{:files 1, :bytes (.length file-obj)}))))
Edit: You're using transients in an incorrect way, by throwing away the result of your reduce. (assoc! m k v)
is allowed to modify and return the m
object, but may return a different object if that's more convenient or efficient. So you need something more like (persistent! (reduce ...))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With