Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clojure Set vs Map Lookup Performance Differences

Tags:

clojure

I have a list of uids and want to check if a uid is a member of this list

The natural way to implement it would be to create a set (clojure.set) of uids and search for that member on that list

What I found out is that map key lookup is a lot faster - I used the following snippet to benchmark both methods:

(def uids #{:a :b :c :d :e :f :g :h :i :j :k :l :m :n :o :p :a1 :b1 :c1 :d1 :e1 :f1 :h1 :i1 :j1 :k1 :l1 :m1 :n1 :o1 :p1})
(def uids-map (reduce (fn [acc v] (assoc acc v true)) {} uids))
(time (dotimes [i 1000000] (:o1 uids)))
;user=> "Elapsed time: 191.076266 msecs"
(time (dotimes [i 1000000] (:o1 uids-map)))
;user=> "Elapsed time: 38.159388 msecs"

the results were very consistent across invocations - map lookup took about 1/5 of set lookup

So is set not optimal for key lookup or am I using it the wrong way?

Also, what are the reasons for the differences in these benchmarks?

I was under the impression that a set is implemented in clojure as an associative data structure similar to vectors - so why is key lookup significantly slower than a simple map?

like image 822
Erez Rabih Avatar asked Nov 18 '19 09:11

Erez Rabih


People also ask

What is the difference between Java equals and = in Clojure?

Of course, Clojure’s = is more general than Java .equals, which is a stricter comparison in several cases; here is just one example: So we need to be sure that the keys we use in our map are carefully constructed to always compare correctly with the stronger .equals semantics.

Why use Cl Clojure?

Clojure is readily optimizable in cases where utmost performance is key. Clojure embraces the “Get it working, then make it fast” approach to building up a software system—even large-scale systems with high performance demands ¹.

Are Clojure collections immutable and persistent?

All of the Clojure collections are immutable and persistent. In particular, the Clojure collections support efficient creation of 'modified' versions, by utilizing structural sharing, and make all of their performance bound guarantees for persistent use. The collections are efficient and inherently thread-safe.

What is the value of any data type in Clojure?

Implement the non-optional (read-only) portion of java.util.Collection or java.util.Map nil is a possible value of any data type in Clojure. nil has the same value as Java null.


2 Answers

I've never went into clojure's source but from what I see the set implementation actually uses a map inside:

protected APersistentSet(IPersistentMap impl){
    this.impl = impl;
}

It also delegates the invoke call to the internal map.

In APersistentSet:

public Object invoke(Object arg1) {
    return get(arg1);
}

// ....

public Object get(Object key){
    return impl.valAt(key);
}

In APersistentMap:

public Object invoke(Object arg1) {
    return valAt(arg1);
}

public Object invoke(Object arg1, Object notFound) {
    return valAt(arg1, notFound);
}

So this can't explain the difference.

As mentioned in the comments by @cgrand, when we reverse the arguments its faster (and about the same, since we call set's invoke immediately). So I looked up Keyword's invoke which is what is probably used for (:k obj):

final public Object invoke(Object obj, Object notFound) {
    if(obj instanceof ILookup)
        return ((ILookup)obj).valAt(this,notFound);
    return RT.get(obj, this, notFound);
}

The important thing to notice is that ILookup is implemented in APersistentMap (through Associative) but not in APersistentSet. You can also verify in clojure:

(instance? clojure.lang.ILookup #{}) ;; false
(instance? clojure.lang.ILookup {})  ;; true

So maps go through the "happy path" and sets end up in RT.get which I believe is the runtime.

Lets have a look at the runtime.

It Initially attempts to do practically the same thing as keyword:

static public Object get(Object coll, Object key){
    if(coll instanceof ILookup)
        return ((ILookup) coll).valAt(key);
    return getFrom(coll, key);
}

But since we know sets do not implement ILookup we know they go to RT.getFrom:

static Object getFrom(Object coll, Object key){
    if(coll == null)
        return null;
    else if(coll instanceof Map) {
        Map m = (Map) coll;
        return m.get(key);
    }
    else if(coll instanceof IPersistentSet) {
        IPersistentSet set = (IPersistentSet) coll;
        return set.get(key);
    }
    else if(key instanceof Number && (coll instanceof String || coll.getClass().isArray())) {
        int n = ((Number) key).intValue();
        if(n >= 0 && n < count(coll))
            return nth(coll, n);
        return null;
    }
    else if(coll instanceof ITransientSet) {
        ITransientSet set = (ITransientSet) coll;
        return set.get(key);
    }

    return null;
}

Which leads me to believe the main difference is the extra delegations and instanceof calls due to sets not implementing ILookup.

As bonus I've added a test on a set that implements ILookup and delegates valAt to the internal map immediately (using proxy) which closed the gap a bit:

(def uids #{:a :b :c :d :e :f :g :h :i :j :k :l :m :n :o :p :a1 :b1 :c1 :d1 :e1 :f1 :h1 :i1 :j1 :k1 :l1 :m1 :n1 :o1 :p1})
(def uids-map (into {} (for [k uids] [k k])))
(def lookupable-set (proxy [clojure.lang.APersistentSet clojure.lang.ILookup] [uids-map]
                      (valAt [k] (get uids-map k))))

;; verify
(instance? clojure.lang.APersistentSet lookupable-set) ;; true
(instance? clojure.lang.ILookup lookupable-set) ;; true

(time (dotimes [i 1000000] (:o1 uids))) ;; 134.703101 msecs
(time (dotimes [i 1000000] (:o1 lookupable-set))) ;; 63.187353 msecs  <-- faster
(time (dotimes [i 1000000] (:o1 uids-map))) ;; 35.802762 msecs <-- still fastest

To conclude: Where performance matters - invoking the set (#{...} k) without going through keyword (k #{...}) is as fast as map.

But I could be wrong :)

like image 111
Reut Sharabani Avatar answered Oct 15 '22 15:10

Reut Sharabani


The implementation of contains? uses clojure.lang.RT.contains which has plenty of instanceof checks (compared to containsKey), which is likely the cause of the performance difference.

like image 23
Jochen Bedersdorfer Avatar answered Oct 15 '22 16:10

Jochen Bedersdorfer