I have a pair of vectors x
and y
of unique items, each of which I know to be sorted. I wish to have the intersection of the two, maintaining sort order. The result ideally would be another vector, for fast random access.
The generation below is merely for the sake of example, my x
and y
will come presorted and pre-distinct (they are in fact time samples).
(defn gen-example [c] (-> (repeatedly c #(-> c rand int)) distinct sort vec))
user=> (def x (gen-example 100000)) (count x)
#'user/x
63161
user=> (def y (gen-example 100000)) (count y)
#'user/y
63224
I know Clojure has clojure.set/intersection
which can work on a sorted-set
. My x
and y
have the same properties (sorted distinct elements) but are not the same type.
Question 1: Is there a better/faster way to convert x
and y
to sorted-set
s than (apply sorted-set x)
given that they are already distinct and sorted?
user=> (time (def ssx (apply sorted-set x)))
"Elapsed time: 607.642592 msecs"
user=> (time (def ssy (apply sorted-set y)))
"Elapsed time: 617.046022 msecs"
Now I am ready to perform my intersection
user=> (time (count (clojure.set/intersection ssx ssy)))
"Elapsed time: 355.42534 msecs"
39992
This is somewhat disappointing performance, and a cursory look at (source clojure.set/intersection)
does not seem to show any special treatment for the fact that these sets are sorted.
Question 2: Is there a better/faster way to perform the intersection of sorted-set
s than clojure.set/intersection
?
(defn intersect-sorted-vector [x y]
(loop [x (seq x) y (seq y) acc []]
(if (and x y)
(let [x1 (first x)
y1 (first y)]
(cond
( < x1 y1) (recur (next x) y acc)
( > x1 y1) (recur x (next y) acc)
:else (recur (next x) (next y) (conj acc x1))))
acc)))
This turns out to be a good deal (nearly 10x) faster.
user=> (time (count (intersect-sorted-vector x y)))
"Elapsed time: 40.142532 msecs"
39992
But, I can't help but feel that my code is unduly procedural/iterative.
Question 3: Could anyone kindly suggest a more idiomatic way to process a pair of vectors in Clojure?
It is often the case that fast Clojure code looks a bit imperative. Functional code is often elegant, but comes with some associated performance costs that you have to pay for (laziness, extra GC pressure from discarded immutable objects etc.)
Also, converting into sets is always going to be more expensive. Building a set is an O(n log n)
operation in itself, but you can exploit the fact that the vectors are already supported to implement an intersection operation in O(n)
time.
Your code is already pretty good, but there are still a couple more optimisations you can do:
Resulting code might look something like:
(defn intersect-sorted-vector [x y]
(loop [i (long 0), j (long 0), r (transient [])]
(let [xi (nth x i nil), yj (nth y j nil)]
(cond
(not (and xi yj)) (persistent! r)
(< xi yj) (recur (inc i) j r)
(> xi yj) (recur i (inc j) r)
:else (recur (inc i) (inc j) (conj! r xi))))))
(time (count (intersect-sorted-vector x y)))
=> "Elapsed time: 5.143687 msecs"
=> 40258
So as you can see, this probably gives you an extra 6-8x speedup or thereabouts.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With