Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Complexity of Clojure's distinct + randomly generated stream

What is the time complexity of an expression

(doall (take n (distinct stream)))

where stream is a lazily generated (possibly infinite) collection with duplicates?

I guess this partially depends on the amount or chance of duplicates in stream? What if stream is (repeatedly #(rand-int m))) where m >= n?

My estimation:

For every element in the resulting list there has to be at least one element realized from the stream. Multiple if the stream has duplicates. For every iteration there is a set lookup and/or insert, but since those are near constant time we get at least: O(n*~1) = O(n) and then some complexity for the duplicates. My intuition is that the complexity for the duplicates can be neglected too, but I'm not sure how to formalize this. For example, we cannot just say it is O(n*k*~1) = O(n) for some constant k since there is not an obvious maximum number k of duplicates we could encounter in the stream.

Let me demonstrate the problem with some data:

(defn stream [upper distinct-n]
  (let [counter (volatile! 0)]
    (doall (take distinct-n
                 (distinct
                  (repeatedly (fn []
                                (vswap! counter inc)
                                (rand-int upper))))))
    @counter))

(defn sample [times-n upper distinct-n]
  (->> (repeatedly times-n
                  #(stream upper distinct-n))
       frequencies
       (sort-by val)
       reverse))

(sample 10000 5 1) ;; ([1 10000])
(sample 10000 5 2) ;; ([2 8024] [3 1562] [4 334] [5 66] [6 12] [8 1] [7 1])
(sample 10000 5 3) ;; ([3 4799] [4 2898] [5 1324] [6 578] [7 236] [8 87] [9 48] [10 14] [11 10] [14 3] [12 2] [13 1])
(sample 10000 5 3) ;; ([3 4881] [4 2787] [5 1359] [6 582] [7 221] [8 107] [9 39] [10 12] [11 9] [12 1] [17 1] [13 1])
(sample 10000 5 4) ;; ([5 2258] [6 1912] [4 1909] [7 1420] [8 985] [9 565] [10 374] [11 226] [12 138] [13 89] [14 50] [15 33] [16 16] [17 9] [18 8] [20 5] [19 1] [23 1] [21 1])
(sample 10000 5 5) ;; ([8 1082] [9 1055] [7 1012] [10 952] [11 805] [6 778] [12 689] [13 558] [14 505] [5 415] [15 387] [16 338] [17 295] [18 203] [19 198] [20 148] [21 100] [22 96] [23 72] [24 53] [25 44] [26 40] [28 35] [27 31] [29 19] [30 16] [31 15] [32 13] [35 10] [34 6] [33 6] [42 3] [38 3] [45 3] [36 3] [37 2] [39 2] [52 1] [66 1] [51 1] [44 1] [41 1] [50 1] [60 1] [58 1])

Note that for the last sample the number of iterations distinct can go up to 66, although the chance is small. Also notice that for increasing n in (sample 10000 n n) the most likely number of realized elements from the stream seems to go up more than linearly.

This chart illustrates the number of realized elements from the input (most common occurance from 10000 samples) in (doall (take n (repeatedly #(rand-int m))) for various numbers of n and m.

data

For completeness, here is the code I used to generate the chart:

(require '[com.hypirion.clj-xchart :as c])

(defn most-common [times-n upper distinct-n]
  (->> (repeatedly times-n
                   #(stream upper distinct-n))
       frequencies
       (sort-by #(- (val %)))
       ffirst))

(defn series [m]
  {(str "m = " m)
   (let [x (range 1 (inc m))]
     {:x x
      :y (map #(most-common 10000 m %)
              x)})})

(c/view
 (c/xy-chart
  (merge (series 10)
         (series 25)
         (series 50)
         (series 100))
  {:x-axis {:title "n"}
   :y-axis {:title "realized"}}))
like image 998
Michiel Borkent Avatar asked Oct 30 '22 16:10

Michiel Borkent


1 Answers

Your problem is known as the Coupon collectors problem and the expected number of elements is given by just summing up m/m + m/(m-1) ... until you have your n items:

(defn general-coupon-collector-expect
  "n: Cardinality of resulting set (# of uniuque coupons to collect)
   m: Cardinality of set to draw from (#coupons that exist)"
  [n m]
  {:pre [(<= n m)]}
  (double (apply + (mapv / (repeat m) (range m (- m n) -1)))))
(general-coupon-collector-expect 25 25)
;; => 95
;; This generates the data for you plot:
(for [x (range 10 101 5)]
  (general-coupon-collector-expect x 100))

Worst case will be infinite. Best case will be just N. Average case is O(N log N). This ignores the complexity of checking if an element has already been drawn. In practice it is Log_32 N for clojure sets (which is used in distinct).

like image 85
ClojureMostly Avatar answered Dec 14 '22 21:12

ClojureMostly