Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is O value for naive random selection from finite set?

This question on getting random values from a finite set got me thinking...

It's fairly common for people to want to retrieve X unique values from a set of Y values. For example, I may want to deal a hand from a deck of cards. I want 5 cards, and I want them to all be unique.

Now, I can do this naively, by picking a random card 5 times, and try again each time I get a duplicate, until I get 5 cards. This isn't so great, however, for large numbers of values from large sets. If I wanted 999,999 values from a set of 1,000,000, for instance, this method gets very bad.

The question is: how bad? I'm looking for someone to explain an O() value. Getting the xth number will take y attempts...but how many? I know how to figure this out for any given value, but is there a straightforward way to generalize this for the whole series and get an O() value?

(The question is not: "how can I improve this?" because it's relatively easy to fix, and I'm sure it's been covered many times elsewhere.)

like image 534
Beska Avatar asked Aug 18 '09 13:08

Beska


2 Answers

Variables

n = the total amount of items in the set
m = the amount of unique values that are to be retrieved from the set of n items
d(i) = the expected amount of tries needed to achieve a value in step i
i = denotes one specific step. i ∈ [0, n-1]
T(m,n) = expected total amount of tries for selecting m unique items from a set of n items using the naive algorithm

Reasoning

The first step, i=0, is trivial. No matter which value we choose, we get a unique one at the first attempt. Hence:

d(0) = 1

In the second step, i=1, we at least need 1 try (the try where we pick a valid unique value). On top of this, there is a chance that we choose the wrong value. This chance is (amount of previously picked items)/(total amount of items). In this case 1/n. In the case where we picked the wrong item, there is a 1/n chance we may pick the wrong item again. Multiplying this by 1/n, since that is the combined probability that we pick wrong both times, gives (1/n)2. To understand this, it is helpful to draw a decision tree. Having picked a non-unique item twice, there is a probability that we will do it again. This results in the addition of (1/n)3 to the total expected amounts of tries in step i=1. Each time we pick the wrong number, there is a chance we might pick the wrong number again. This results in:

d(1) = 1 + 1/n + (1/n)2 + (1/n)3 + (1/n)4 + ...

Similarly, in the general i:th step, the chance to pick the wrong item in one choice is i/n, resulting in:

d(i) = 1 + i/n + (i/n)2 + (i/n)3 + (i/n)4 + ... =
= sum( (i/n)k ), where k ∈ [0,∞]

This is a geometric sequence and hence it is easy to compute it's sum:

d(i) = (1 - i/n)-1

The overall complexity is then computed by summing the expected amount of tries in each step:

T(m,n) = sum ( d(i) ), where i ∈ [0,m-1] =
= 1 + (1 - 1/n)-1 + (1 - 2/n)-1 + (1 - 3/n)-1 + ... + (1 - (m-1)/n)-1

Extending the fractions in the series above by n, we get:

T(m,n) = n/n + n/(n-1) + n/(n-2) + n/(n-3) + ... + n/(n-m+2) + n/(n-m+1)

We can use the fact that:

n/n ≤ n/(n-1) ≤ n/(n-2) ≤ n/(n-3) ≤ ... ≤ n/(n-m+2) ≤ n/(n-m+1)

Since the series has m terms, and each term satisfies the inequality above, we get:

T(m,n) ≤ n/(n-m+1) + n/(n-m+1) + n/(n-m+1) + n/(n-m+1) + ... + n/(n-m+1) + n/(n-m+1) =
= m*n/(n-m+1)

It might be(and probably is) possible to establish a slightly stricter upper bound by using some technique to evaluate the series instead of bounding by the rough method of (amount of terms) * (biggest term)

Conclusion

This would mean that the Big-O order is O(m*n/(n-m+1)). I see no possible way to simplify this expression from the way it is.

Looking back at the result to check if it makes sense, we see that, if n is constant, and m gets closer and closer to n, the results will quickly increase, since the denominator gets very small. This is what we'd expect, if we for example consider the example given in the question about selecting "999,999 values from a set of 1,000,000". If we instead let m be constant and n grow really, really large, the complexity will converge towards O(m) in the limit n → ∞. This is also what we'd expect, since while chosing a constant number of items from a "close to" infinitely sized set the probability of choosing a previously chosen value is basically 0. I.e. We need m tries independently of n since there are no collisions.

like image 101
Alderath Avatar answered Sep 20 '22 12:09

Alderath


If you already have chosen i values then the probability that you pick a new one from a set of y values is

(y-i)/y.

Hence the expected number of trials to get (i+1)-th element is

y/(y-i).

Thus the expected number of trials to choose x unique element is the sum

 y/y + y/(y-1) + ... + y/(y-x+1)

This can be expressed using harmonic numbers as

y (Hy - Hy-x).

From the wikipedia page you get the approximation

Hx = ln(x) + gamma + O(1/x)

Hence the number of necessary trials to pick x unique elements from a set of y elements is

y (ln(y) - ln(y-x)) + O(y/(y-x)).

If you need then you can get a more precise approximation by using a more precise approximation for Hx. In particular, when x is small it is possible to improve the result a lot.

like image 20
Accipitridae Avatar answered Sep 18 '22 12:09

Accipitridae