Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clojure: Split a sequence into top-n and rest?

Tags:

clojure

I'd like split a sequence into it's top-n elements and the rest. Here's an inefficient implementation using the built in sort and split-at:

> (defn split-top-n
    [n comp coll]
    (split-at n (sort comp coll)))

> (split-top-n 2 #(- %2 %1) (list 6.2 5.1 88.0 90.1 1.2 16.9))
[(90.1 88.0) (16.9 6.2 5.1 1.2)]

Is there an efficient Clojure built-in for this? Or do I need to write my own?

like image 643
bjun Avatar asked Mar 23 '23 11:03

bjun


2 Answers

There is no such function in the standard library. The simple implementation that you've already written is actually only inefficient for the special case of small values of n, but perfectly fine in the general case.

As long as you don't know that this function in its current implementation is really a significant performance bottleneck in your complete application, it's probably a waste of effort to write a more complicate version.

EDIT: Thinking somewhat more about this, it might be worth a try to write an implementation that coerces the sequence into a vector and then does an in-place Quickselect to partition the n best elements to the beginning of the vector. That should be relatively easy to do and could provide a reasonable better performance as long as your pivot elements are well chosen.

EDIT 2: I decided to try that implementation myself. It worked fine with some simple test cases, but I'm not completely sure that there are no off-by-one bugs in this that might trigger in some edge cases:

(defn split-top-n
  [n comp coll]
  (let [v (transient (vec coll))]
    (loop [start 0, end (count v)]
      (when (> end n)
        (let [pos (loop [i (inc start), pos start]
                    (if (< i end)
                      (if (comp (v i) (v start))
                        (let [pos* (inc pos)]
                          (assoc! v, i (v pos*), pos* (v i))
                          (recur (inc i) pos*))
                        (recur (inc i) pos))
                      (do
                        (assoc! v, start (v pos), pos (v start))
                        pos)))]
          (if (< pos n)
            (recur (inc pos) end)
            (recur start pos)))))
    (split-at n (persistent! v))))

Clarification: this expects a simple boolean comparator function for comp instead of one of the negative/zero/positive number kind.

EDIT 3: I took another look at the documentation for transients and noticed that I was exploiting undefined behaviour. It may actually be the case that the above version will actually always work as expected, but a correct version should nonetheless respect the language documentation. I will leave the previous version in this answer in place because the answer was already accepted with it, but here's a version that's using the return value of assoc! as the documentation demands:

(defn swap-in!
  [v i j]
  (assoc! v, i (v j), j (v i)))

(defn quickpartition!
  [comp v start end]
  (loop [v v, i (inc start), pos start]
    (if (< i end)
      (if (comp (v i) (v start))
        (recur (swap-in! v i (inc pos)) (inc i) (inc pos))
        (recur v (inc i) pos))
      [(swap-in! v start pos) pos])))

(defn split-top-n
  [n comp coll]
  (loop [v (transient (vec coll)), start 0, end (count v)]
    (if (> end n)
      (let [[v* pos] (quickpartition! comp v start end)]
        (if (< pos n)
          (recur v* (inc pos) end)
          (recur v* start pos)))
      (split-at n (persistent! v)))))

EDIT 4: The bad readability of the earlier version still itched me, so I have now split my implementation into multiple functions.

like image 93
Rörd Avatar answered Apr 07 '23 05:04

Rörd


You can use a data structure such as sorted-set which it is currently implemented as clojure.lang.PersistentTreeSet. This way you can avoid the sorting before you get your top n elements (I would say).

(-> (sorted-set-by >) 
    (conj 90) 
    (conj 10) 
    (conj 1))

#{90 10 1}

Now you can call split-at function:

(split-at n previous-sorted-set)

But that depends on whether you want/can use a sorted set.

like image 31
Chiron Avatar answered Apr 07 '23 04:04

Chiron