Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently selecting a set of random elements from a linked list

Say I have a linked list of numbers of length N. N is very large and I don’t know in advance the exact value of N.

How can I most efficiently write a function that will return k completely random numbers from the list?

like image 734
Matt Sheppard Avatar asked Sep 10 '08 13:09

Matt Sheppard


People also ask

What will be time Complexity of doing random access in a singly linked list with n nodes?

Whereas in a singly-linked list, the pointer to the previous node is unknown and can be found only by traversing the list from head until it reaches the node that has a next node pointer to the node that is to be deleted. The time complexity in this case is O(n).

What will be the probability of selecting a random node from a given single linked list?

Given a singly linked list, select a random node from the linked list (the probability of picking a node should be 1/N if there are N nodes in the list).


1 Answers

There's a very nice and efficient algorithm for this using a method called reservoir sampling.

Let me start by giving you its history:

Knuth calls this Algorithm R on p. 144 of his 1997 edition of Seminumerical Algorithms (volume 2 of The Art of Computer Programming), and provides some code for it there. Knuth attributes the algorithm to Alan G. Waterman. Despite a lengthy search, I haven't been able to find Waterman's original document, if it exists, which may be why you'll most often see Knuth quoted as the source of this algorithm.

McLeod and Bellhouse, 1983 (1) provide a more thorough discussion than Knuth as well as the first published proof (that I'm aware of) that the algorithm works.

Vitter 1985 (2) reviews Algorithm R and then presents an additional three algorithms which provide the same output, but with a twist. Rather than making a choice to include or skip each incoming element, his algorithm predetermines the number of incoming elements to be skipped. In his tests (which, admittedly, are out of date now) this decreased execution time dramatically by avoiding random number generation and comparisons on each in-coming number.

In pseudocode the algorithm is:

Let R be the result array of size s Let I be an input queue  > Fill the reservoir array for j in the range [1,s]:   R[j]=I.pop()  elements_seen=s while I is not empty:   elements_seen+=1   j=random(1,elements_seen)       > This is inclusive   if j<=s:     R[j]=I.pop()   else:     I.pop() 

Note that I've specifically written the code to avoid specifying the size of the input. That's one of the cool properties of this algorithm: you can run it without needing to know the size of the input beforehand and it still assures you that each element you encounter has an equal probability of ending up in R (that is, there is no bias). Furthermore, R contains a fair and representative sample of the elements the algorithm has considered at all times. This means you can use this as an online algorithm.

Why does this work?

McLeod and Bellhouse (1983) provide a proof using the mathematics of combinations. It's pretty, but it would be a bit difficult to reconstruct it here. Therefore, I've generated an alternative proof which is easier to explain.

We proceed via proof by induction.

Say we want to generate a set of s elements and that we have already seen n>s elements.

Let's assume that our current s elements have already each been chosen with probability s/n.

By the definition of the algorithm, we choose element n+1 with probability s/(n+1).

Each element already part of our result set has a probability 1/s of being replaced.

The probability that an element from the n-seen result set is replaced in the n+1-seen result set is therefore (1/s)*s/(n+1)=1/(n+1). Conversely, the probability that an element is not replaced is 1-1/(n+1)=n/(n+1).

Thus, the n+1-seen result set contains an element either if it was part of the n-seen result set and was not replaced---this probability is (s/n)*n/(n+1)=s/(n+1)---or if the element was chosen---with probability s/(n+1).

The definition of the algorithm tells us that the first s elements are automatically included as the first n=s members of the result set. Therefore, the n-seen result set includes each element with s/n (=1) probability giving us the necessary base case for the induction.

References

  1. McLeod, A. Ian, and David R. Bellhouse. "A convenient algorithm for drawing a simple random sample." Journal of the Royal Statistical Society. Series C (Applied Statistics) 32.2 (1983): 182-184. (Link)

  2. Vitter, Jeffrey S. "Random sampling with a reservoir." ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57. (Link)

like image 147
Richard Avatar answered Sep 28 '22 10:09

Richard