I have a list of uids and want to check if a uid is a member of this list
The natural way to implement it would be to create a set (clojure.set
) of uids and search for that member on that list
What I found out is that map key lookup is a lot faster - I used the following snippet to benchmark both methods:
(def uids #{:a :b :c :d :e :f :g :h :i :j :k :l :m :n :o :p :a1 :b1 :c1 :d1 :e1 :f1 :h1 :i1 :j1 :k1 :l1 :m1 :n1 :o1 :p1})
(def uids-map (reduce (fn [acc v] (assoc acc v true)) {} uids))
(time (dotimes [i 1000000] (:o1 uids)))
;user=> "Elapsed time: 191.076266 msecs"
(time (dotimes [i 1000000] (:o1 uids-map)))
;user=> "Elapsed time: 38.159388 msecs"
the results were very consistent across invocations - map lookup took about 1/5 of set lookup
So is set not optimal for key lookup or am I using it the wrong way?
Also, what are the reasons for the differences in these benchmarks?
I was under the impression that a set is implemented in clojure as an associative data structure similar to vectors - so why is key lookup significantly slower than a simple map?
Of course, Clojure’s = is more general than Java .equals, which is a stricter comparison in several cases; here is just one example: So we need to be sure that the keys we use in our map are carefully constructed to always compare correctly with the stronger .equals semantics.
Clojure is readily optimizable in cases where utmost performance is key. Clojure embraces the “Get it working, then make it fast” approach to building up a software system—even large-scale systems with high performance demands ¹.
All of the Clojure collections are immutable and persistent. In particular, the Clojure collections support efficient creation of 'modified' versions, by utilizing structural sharing, and make all of their performance bound guarantees for persistent use. The collections are efficient and inherently thread-safe.
Implement the non-optional (read-only) portion of java.util.Collection or java.util.Map nil is a possible value of any data type in Clojure. nil has the same value as Java null.
I've never went into clojure's source but from what I see the set implementation actually uses a map inside:
protected APersistentSet(IPersistentMap impl){
this.impl = impl;
}
It also delegates the invoke
call to the internal map.
In APersistentSet:
public Object invoke(Object arg1) {
return get(arg1);
}
// ....
public Object get(Object key){
return impl.valAt(key);
}
In APersistentMap:
public Object invoke(Object arg1) {
return valAt(arg1);
}
public Object invoke(Object arg1, Object notFound) {
return valAt(arg1, notFound);
}
So this can't explain the difference.
As mentioned in the comments by @cgrand, when we reverse the arguments its faster (and about the same, since we call set's invoke
immediately). So I looked up Keyword's invoke
which is what is probably used for (:k obj)
:
final public Object invoke(Object obj, Object notFound) {
if(obj instanceof ILookup)
return ((ILookup)obj).valAt(this,notFound);
return RT.get(obj, this, notFound);
}
The important thing to notice is that ILookup
is implemented in APersistentMap
(through Associative
) but not in APersistentSet
. You can also verify in clojure:
(instance? clojure.lang.ILookup #{}) ;; false
(instance? clojure.lang.ILookup {}) ;; true
So maps go through the "happy path" and sets end up in RT.get
which I believe is the runtime.
Lets have a look at the runtime.
It Initially attempts to do practically the same thing as keyword:
static public Object get(Object coll, Object key){
if(coll instanceof ILookup)
return ((ILookup) coll).valAt(key);
return getFrom(coll, key);
}
But since we know sets do not implement ILookup
we know they go to RT.getFrom
:
static Object getFrom(Object coll, Object key){
if(coll == null)
return null;
else if(coll instanceof Map) {
Map m = (Map) coll;
return m.get(key);
}
else if(coll instanceof IPersistentSet) {
IPersistentSet set = (IPersistentSet) coll;
return set.get(key);
}
else if(key instanceof Number && (coll instanceof String || coll.getClass().isArray())) {
int n = ((Number) key).intValue();
if(n >= 0 && n < count(coll))
return nth(coll, n);
return null;
}
else if(coll instanceof ITransientSet) {
ITransientSet set = (ITransientSet) coll;
return set.get(key);
}
return null;
}
Which leads me to believe the main difference is the extra delegations and instanceof
calls due to sets not implementing ILookup
.
As bonus I've added a test on a set that implements ILookup
and delegates valAt
to the internal map immediately (using proxy
) which closed the gap a bit:
(def uids #{:a :b :c :d :e :f :g :h :i :j :k :l :m :n :o :p :a1 :b1 :c1 :d1 :e1 :f1 :h1 :i1 :j1 :k1 :l1 :m1 :n1 :o1 :p1})
(def uids-map (into {} (for [k uids] [k k])))
(def lookupable-set (proxy [clojure.lang.APersistentSet clojure.lang.ILookup] [uids-map]
(valAt [k] (get uids-map k))))
;; verify
(instance? clojure.lang.APersistentSet lookupable-set) ;; true
(instance? clojure.lang.ILookup lookupable-set) ;; true
(time (dotimes [i 1000000] (:o1 uids))) ;; 134.703101 msecs
(time (dotimes [i 1000000] (:o1 lookupable-set))) ;; 63.187353 msecs <-- faster
(time (dotimes [i 1000000] (:o1 uids-map))) ;; 35.802762 msecs <-- still fastest
To conclude: Where performance matters - invoking the set (#{...} k)
without going through keyword (k #{...})
is as fast as map.
But I could be wrong :)
The implementation of contains?
uses clojure.lang.RT.contains
which has plenty of instanceof
checks (compared to containsKey
), which is likely the cause of the performance difference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With