Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Repeating a Biased Random Shuffle Reduce the Bias?

I'd like to produce fast random shuffles repeatedly with minimal bias.

It's known that the Fisher-Yates shuffle is unbiased as long as the underlying random number generator (RNG) is unbiased.

To shuffle an array a of n elements:
  for i from n − 1 downto 1 do
       j ← random integer with 0 ≤ j ≤ i
       exchange a[j] and a[i]

But what if the RNG is biased (but fast)?

Suppose I want to produce many random permutations of an array of 25 elements. If I use the Fisher-Yates algorithm with a biased RNG, then my permutation will be biased, but I believe this assumes that the 25-element array starts from the same state before each application of the shuffle algorithm. One problem, for example, is if the RNG only has a period of 2^32 ~ 10^9 we can not produce every possible permutation of the 25 elements because this is 25! ~ 10^25 permutations.

My general question is, if I leave the shuffled elements shuffled before starting each new application of the Fisher-Yates shuffle, would this reduce the bias and/or allow the algorithm to produce every permutation?

My guess is it would generally produce better results, but it seems like if the array being repeatedly shuffled had a number of elements that was related to the underlying RNG that the permutations could actually repeat more often than expected.

Does anyone know of any research that addresses this?

As a sub-question, what if I only want repeated permutations of 5 of the 25 elements in the array, so I use the Fisher-Yates algorithm to select 5 elements and stop before doing a full shuffle? (I use the 5 elements on the end of the array that got swapped.) Then I start over using the previous partially shuffled 25-element array to select another permutation of 5. Again, it seems like this would be better than starting from the original 25-element array if the underlying RNG had a bias. Any thoughts on this?

I think it would be easier to test the partial shuffle case since there are only 6,375,600 possible permutations of 5 out of 25 elements, so are there any simple tests to use to check for biases?

like image 795
JohnPS Avatar asked Sep 29 '10 22:09

JohnPS


3 Answers

if the RNG only has a period of 2^32 ~ 10^9 we can not produce every possible permutation of the 25 elements because this is 25! ~ 10^25 permutations

This is only true as long as the seed determines every successive selection. As long as your RNG can be expected to deliver a precisely even distribution over the range specified for each next selection, then it can produce every permutation. If your RNG cannot do that, having a larger seed base will not help.

As for your side question, you might as well reseed for every draw. However, reseeding the generator is only useful if reseeding it contains enough entropy. Time stamps don't contain much entropy, neither do algorithmic calculations.

I'm not sure what this solution is part of because you have not listed it, but if you are trying to calculate something from a larger domain using random input, there are probably better methods.

like image 99
Nick Larsen Avatar answered Oct 30 '22 10:10

Nick Larsen


A couple of points:

1) Anyone using the Fisher Yates shuffle should read this and make doubly sure their implementation is correct.
2) Doesn't repeating the shuffle defeat the purpose of using a faster random number generator? Surely if you're going to have to repeat every shuffle 5 times to get the desired entropy you're better using a low bias generator.
3) Do you have a set up where you can test this? If so start trying things - Jeffs graphs make it clear that you can easily detect quite a lot of errors by using small decks and visually portraying the results.

like image 30
Daniel Avatar answered Oct 30 '22 10:10

Daniel


My feeling is that with a biased RNG repeated runs of the Knuth shuffle would produce all the permutations, but I'm not able to prove it (it depends on the period of the RNG and how much biased it is).

So let's reverse the question: given an algorithm that requires a random input and a biased RNG, is it easier to de-skew the algorithm's output or to de-skew the RNG's output?

Unsurprisingly, the latter is much easier to do (and is of broader interest): there are several standard techniques to do it. A simple technique, due to Von Neumann, is: given a bitstream from a biased RNG, take bits in pairs, throw away every (0,0) and (1,1) pair, return a 1 for every (1,0) pair and a 0 for every (0,1) pair. This technique assumes that the bits are from a stream where each bit has the same probability of being a 0 or 1 as any other bit in the stream and that bits are not correlated. Elias generalized von Neumann's technique to a more efficient scheme (one where fewer bits are discarded).

But even strongly biased or correlated bits, may contain useful amounts of randomness, for example using a technique based on Fast Fourier Transform.

Another option is to feed the biased RNG output to a cryptographically strong function, for example a message digest algorithm, and use its output.

For further references on how to de-skew random number generators, I suggest you to read the Randomness Recommendations for Security RFC.

My point is that the quality if the output of a random-based algorithm is upper bounded by the entropy provided by the RNG: if it is extremely biased the output will be extremely biased, no matter what you do. The algorithm can't squeeze more entropy than the one contained in the biased random bitstream. Worse: it will probably lose some random bits. Even assuming that the algorithm works with a biased RNG, to obtain good result you'll have to put a computational effort at least as great as the effort that it would take to de-skew the RNG (but it probably will require more effort, since you'll have to both run the algorithm and "defeat" the biasing at the same time).

If your question is just theoretical, then please disregard this answer. If it is practical then please seriously think about de-skewing your RNG instead of making assumption about the output of the algorithm.

like image 42
Giuseppe Cardone Avatar answered Oct 30 '22 09:10

Giuseppe Cardone