I observed the following unexpected behaviour when using ScalaCheck's Gen.pic, which (for me) indicates that its picking is not quite random, even though its documentation says so:
/** A generator that picks a given number of elements from a list, randomly */
I ran the below three little programs in order (in a span of 2 days, at different times, as it might matter) after setting
implicit override val generatorDrivenConfig = PropertyCheckConfig(
maxSize = 1000,
minSize = 1000,
minSuccessful = 1000)
to get a decent sample size.
Program #1
val set = Set(1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30,
31,32,33,34,35,36,37,38,39,40,
41,42,43,44,45,46,47,48,49,50)
// Thanks to @Jubobs for the solution
// See: http://stackoverflow.com/a/43825913/4169924
val g = Gen.pick(3, set).map { _.toList }
forAll (g) { s => println(s) }
Out of the 3000 numbers generated at 2 different runs I got a surprisingly similar, and quite non-random distribution (numbers are rounded, only top 5 listed, as for all listing from here on):
(Disclaimer: I couldn't find how to create a table here other then this way)
Program 2
val list: List[Int] = List.range(1, 50)
val g = Gen.pick(3, list)
forAll (g) { s => println(s) }
In case of using a List
, the numbers seem to get "stuck" at the end of the range (3x1000 numbers in case of both runs):
Interestingly, the frequencies are pretty much the same as in the case of Program 1.
Remark: I repeated the runs for lists up to 10 times, and experienced the very same distribution with +/- 1% differences, just didn't want to list all the numbers here in this strange "table" format.
Program 3
Just to spice up things a bit, I ran a third little snippet, creating the Set
(Program 1) from a List
(Program 2):
val set: Set[Int] = List.range(1, 50).toSet
val g = Gen.pick(3, set).map { _.toList }
forAll (g) { s => println(s) }
Now the numbers are the same as for Program 2 (List
wins!), although the frequencies (again, for 3*1000 numbers in 2 runs) got slightly different at the end:
Question
Even though the sample size is not enough (as it is never enough) to tell true randomness, I can't help but question Gen.pick
's claimed randomness (as far as using it out-of-the-box, I might need to set some seed for it to work "more" randomly), since numbers got "stuck", and frequencies are almost the same.
Upon looking at Gen.pick
's source code, at line #672 a certain seed0
is used:
def pick[T](n: Int, l: Iterable[T]): Gen[Seq[T]] = {
if (n > l.size || n < 0) throw new IllegalArgumentException(s"invalid choice: $n")
else if (n == 0) Gen.const(Nil)
else gen { (p, seed0) =>
// ...
which I can't find defined anywhere else (in Gen.scala source code, or in scala.util.Random documentation), but I have a hunch it might have something to do with the observed behaviour.
Is this expected behaviour of Gen.pick
? If so, how can I get "more" random picking?
Although @ashawley answer is already accepted I don't think it is correct. I think that this is actually a bug and it was introduced by erik-stripe's commit on Sep 1, 2016 and the bug is actually in the line
val i = (x & 0x7fffffff).toInt % n
and it was supposed to be
val i = (x & 0x7fffffff).toInt % count
which is still not quite correct.
I also expect that your 33% for the last value is actually 100% and you didn't take into account the fact that you select 3 elements so all your statistics should be multiplied by 3. So for 3-element selection the last element is selected 100% of the time, the previous one - 66.6% and so on, which is even worse than you expected.
Here is again an excerpt from the code:
else gen { (p, seed0) =>
val buf = ArrayBuffer.empty[T]
val it = l.iterator
var seed = seed0
var count = 0
while (it.hasNext) {
val t = it.next
count += 1
if (count <= n) {
buf += t
} else {
val (x, s) = seed.long
val i = (x & 0x7fffffff).toInt % n
if (i < n) buf(i) = t
seed = s
}
}
r(Some(buf), seed)
}
So what this code is supposed to do and what it actually does? The if (count <= n)
branch fills the output buf
with first n
elements and after that always the else
branch works. To make it more clear I changed the while
moving if
outside to the following code:
for (i <- 0 until n) {
val t = it.next
buf += t
}
while (it.hasNext) {
val t = it.next
val (x, s) = seed.long
val i = (x & 0x7fffffff).toInt % n
if (i < n) buf(i) = t
seed = s
}
So now it becomes obvious that the else
branch should at the same time decide if the current element should be added to the output buf
AND which element there it should replace. Obviously, current code always selects each element because if (i < n)
is always true given that i
is calculated as something % n
. And this is why you see such a huge skew to the last elements.
Obviously the plan was to use a modified version of Fisher–Yates shuffle that selects only first n
elements of the shuffle and to do it properly you need to select random numbers in the range [0, count
) and this is probably why the code is written the way it is written i.e. preserving counter
inside while
loop.
Using % count
is still not quite correct because such a simple way doesn't produce uniform distribution when count
is not a power of 2. To be more fair something like
val c0 = choose(0, count-1)
val rt: R[Int] = c0.doApply(p, seed)
seed = rt.seed
val i = rt.retrieve.get // index to swap current element with. Should be fair random number in range [0, count-1], see Fisher–Yates shuffle
if (i < n) buf(i) = t
or some other way to create i
as a fair uniformly distributed random number in such a range should be used.
Update (Why just % count
is wrong)
You can look at java.util.Random.nextInt(int) implementation or org.scalacheck.Choose.chLng for an example of how it should be done. It is more complicated than just % count
and there is a good reason for it. To illustrate it consider following example. Let's assume that your source random generator generates uniformly random 3-bit values i.e in range just [0, 7] and you want to get ranadom number in range [0, 2]
and you do it by simply doing
srcGenerator.nextInt() % 3
Now consider mapping of values in range [0, 7]
to your range [0, 2]
:
0, 3, 6
will be mapped to 0
(i.e. 3 values are mapped)1, 4, 7
will be mapped to 1
(i.e. 3 values are mapped)2, 5
will be mapped to 2
(i.e. only 2 values are mapped)So if you do just % 3
your distribution will be 0 - 3/8, 1 - 3/8, 2 - 2/8 which is obviously not uniform. This is why those implementations I referenced earlier use some kind of a loop and discard some values generated by the source generator. It is required to produce unifrom distribution.
I don't think it has anything to do with the seed. It has everything to do with Scalacheck's heurisitic.
There is a subtle bug. Consider what it's doing. It is forcing itself to choose values at the beginning, and than randomly overwriting them after that:
while (it.hasNext) {
val t = it.next
count += 1
if (count <= n) {
buf += t
} else {
val (x, s) = seed.long
val i = (x & 0x7fffffff).toInt % n
if (i < n) buf(i) = t
seed = s
}
...
It is randomly assigning those elements to the result in the else
-block, that's why it is giving preference to the tail values.
So, pick
is randomly selecting values from a set. However, it has sacrificed picking equally among the values and favors the end of the list because the code is trying to iterate through the list lazily.
To attempt to get an even distribution of picked elements, you would need to know the length of the collection, but as my answer suggests, that's not possible without consuming the iterable twice.
Maybe if you ran reverse
on your list, or shuffle
, you'd get a better distribution of picks with pick
.
Because Scalacheck is a generic property testing library, I predict it can't do either of these things without sacrificing performance for arbitrary-sized collections.
But as Alexey Romanov points out, this should implement the reservoir sampling algorithm which avoids knowing the length and can be run in O(n) time. There is simply a defect in the code. The fix is simply correcting the line for random number generation. It should get a random number from 1 to the kth (count
) element visited in the list.
val i = (x & 0x7fffffff).toInt % n
Should be:
val i = (x & 0x7fffffff).toInt % count
I've submitted a PR to Scalacheck:
https://github.com/rickynils/scalacheck/pull/333
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With