Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clojure - counting unique values from vectors in a seq

Tags:

clojure

Being somewhat new to Clojure I can't seem to figure out how to do something that seems like it should be simple. I just can't see it. I have a seq of vectors. Let's say each vector has two values representing customer number and invoice number and each of the vectors represents a sale of an item. So it would look something like this:

([ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ])

I want to count the number of unique customers and unique invoices. So the example should produce the vector

[ 2 3 ]

In Java or another imperative language I would loop over each one of the vectors in the seq, add the customer number and invoice number to a set then count the number of values in each set and return it. I can't see the functional way to do this.

Thanks for the help.

EDIT: I should have specified in my original question that the seq of vectors is in the 10's of millions and actually has more than just two values. So I want to only go through the seq one time and calculate these unique counts (and some sums as well) on that one run through the seq.

like image 200
Dave Kincaid Avatar asked Jul 11 '12 12:07

Dave Kincaid


3 Answers

In Clojure you can do it almost the same way - first call distinct to get unique values and then use count to count results:

(def vectors (list [ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ]))
(defn count-unique [coll] 
   (count (distinct coll)))
(def result [(count-unique (map first vectors)) (count-unique (map second vectors))])

Note that here you first get list of first and second elements of vectors (map first/second vectors) and then operate on each separately, thus iterating over collection twice. If performance does matter, you may do same thing with iteration (see loop form or tail recursion) and sets, just like you would do in Java. To further improve performance you can also use transients. Though for beginner like you I would recommend first way with distinct.

UPD. Here's version with loop:

(defn count-unique-vec [coll]
  (loop [coll coll, e1 (transient #{}), e2 (transient #{})]
    (cond (empty? coll) [(count (persistent! e1)) (count (persistent! e2))]
          :else (recur (rest coll)
                       (conj! e1 (first (first coll)))
                       (conj! e2 (second (first coll)))))))
(count-unique-vec vectors)    ==> [2 3]

As you can see, no need in atoms or something like that. First, you pass state to each next iteration (recur call). Second, you use transients to use temporary mutable collections (read more on transients for details) and thus avoid creation of new object each time.

UPD2. Here's version with reduce for extended question (with price):

(defn count-with-price
  "Takes input of form ([customer invoice price] [customer invoice price] ...)  
   and produces vector of 3 elements, where 1st and 2nd are counts of unique    
   customers and invoices and 3rd is total sum of all prices"
  [coll]
  (let [[custs invs total]
        (reduce (fn [[custs invs total] [cust inv price]]
                  [(conj! custs cust) (conj! invs inv) (+ total price)])
            [(transient #{}) (transient #{}) 0]
            coll)]
    [(count (persistent! custs)) (count (persistent! invs)) total]))

Here we hold intermediate results in a vector [custs invs total], unpack, process and pack them back to a vector each time. As you can see, implementing such nontrivial logic with reduce is harder (both to write and read) and requires even more code (in looped version it is enough to add one more parameter for price to loop args). So I agree with @ammaloy that for simpler cases reduce is better, but more complex things require more low-level constructs, such as loop/recur pair.

like image 74
ffriend Avatar answered Nov 17 '22 16:11

ffriend


As is often the case when consuming a sequence, reduce is nicer than loop here. You can just do:

(map count (reduce (partial map conj) 
                   [#{} #{}]
                   txn))

Or, if you're really into transients:

(map (comp count persistent!)
     (reduce (partial map conj!) 
             (repeatedly 2 #(transient #{}))
             txn))

Both of these solutions traverse the input only once, and they take much less code than the loop/recur solution.

like image 10
amalloy Avatar answered Nov 17 '22 16:11

amalloy


Or you could use sets to handle the de-duping for you since sets can have a max of one of any specific value.

(def vectors '([100 2000] [100 2000] [101 2001] [100 2002]))    
[(count (into #{} (map first vectors)))  (count (into #{} (map second vectors)))]
like image 4
Base Avatar answered Nov 17 '22 16:11

Base