Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is ScalaCheck's Gen.pick really random?

I observed the following unexpected behaviour when using ScalaCheck's Gen.pic, which (for me) indicates that its picking is not quite random, even though its documentation says so:

/** A generator that picks a given number of elements from a list, randomly */

I ran the below three little programs in order (in a span of 2 days, at different times, as it might matter) after setting

implicit override val generatorDrivenConfig = PropertyCheckConfig(
  maxSize = 1000, 
  minSize = 1000, 
  minSuccessful = 1000)

to get a decent sample size.

Program #1

val set = Set(1,2,3,4,5,6,7,8,9,10,
      11,12,13,14,15,16,17,18,19,20,
      21,22,23,24,25,26,27,28,29,30,
      31,32,33,34,35,36,37,38,39,40,
      41,42,43,44,45,46,47,48,49,50)

// Thanks to @Jubobs for the solution
// See: http://stackoverflow.com/a/43825913/4169924
val g = Gen.pick(3, set).map { _.toList }
forAll (g) { s => println(s) }

Out of the 3000 numbers generated at 2 different runs I got a surprisingly similar, and quite non-random distribution (numbers are rounded, only top 5 listed, as for all listing from here on):

  • Number: frequency in run #1, frequency in run #2
  • 15: 33%, 33%
  • 47: 22%, 22%
  • 4: 15%, 16%
  • 19: 10%, 10%
  • 30: 6%, 6%

(Disclaimer: I couldn't find how to create a table here other then this way)

Program 2

val list: List[Int] = List.range(1, 50)
val g = Gen.pick(3, list)
forAll (g) { s => println(s) }

In case of using a List, the numbers seem to get "stuck" at the end of the range (3x1000 numbers in case of both runs):

  • 49: 33%, 33%
  • 48: 22%, 22%
  • 47: 14%, 14%
  • 46: 10%, 10%
  • 45: 6%, 6%

Interestingly, the frequencies are pretty much the same as in the case of Program 1.

Remark: I repeated the runs for lists up to 10 times, and experienced the very same distribution with +/- 1% differences, just didn't want to list all the numbers here in this strange "table" format.

Program 3

Just to spice up things a bit, I ran a third little snippet, creating the Set (Program 1) from a List (Program 2):

val set: Set[Int] = List.range(1, 50).toSet
val g = Gen.pick(3, set).map { _.toList }
forAll (g) { s => println(s) }

Now the numbers are the same as for Program 2 (List wins!), although the frequencies (again, for 3*1000 numbers in 2 runs) got slightly different at the end:

  • 49: 33%, 33%
  • 48: 23%, 22%
  • 47: 16%, 15%
  • 46: 9%, 10%
  • 45: 7%, 6%

Question

Even though the sample size is not enough (as it is never enough) to tell true randomness, I can't help but question Gen.pick's claimed randomness (as far as using it out-of-the-box, I might need to set some seed for it to work "more" randomly), since numbers got "stuck", and frequencies are almost the same.

Upon looking at Gen.pick's source code, at line #672 a certain seed0 is used:

def pick[T](n: Int, l: Iterable[T]): Gen[Seq[T]] = {
    if (n > l.size || n < 0) throw new IllegalArgumentException(s"invalid choice: $n")
    else if (n == 0) Gen.const(Nil)
    else gen { (p, seed0) =>
    // ...

which I can't find defined anywhere else (in Gen.scala source code, or in scala.util.Random documentation), but I have a hunch it might have something to do with the observed behaviour. Is this expected behaviour of Gen.pick? If so, how can I get "more" random picking?

like image 351
bugfoot Avatar asked May 08 '17 04:05

bugfoot


Video Answer


2 Answers

Although @ashawley answer is already accepted I don't think it is correct. I think that this is actually a bug and it was introduced by erik-stripe's commit on Sep 1, 2016 and the bug is actually in the line

      val i = (x & 0x7fffffff).toInt % n

and it was supposed to be

      val i = (x & 0x7fffffff).toInt % count

which is still not quite correct.

I also expect that your 33% for the last value is actually 100% and you didn't take into account the fact that you select 3 elements so all your statistics should be multiplied by 3. So for 3-element selection the last element is selected 100% of the time, the previous one - 66.6% and so on, which is even worse than you expected.

Here is again an excerpt from the code:

else gen { (p, seed0) =>
  val buf = ArrayBuffer.empty[T]
  val it = l.iterator
  var seed = seed0
  var count = 0
  while (it.hasNext) {
    val t = it.next
    count += 1
    if (count <= n) {
      buf += t
    } else {
      val (x, s) = seed.long
      val i = (x & 0x7fffffff).toInt % n
      if (i < n) buf(i) = t
      seed = s
    }
  }
  r(Some(buf), seed)
}

So what this code is supposed to do and what it actually does? The if (count <= n) branch fills the output buf with first n elements and after that always the else branch works. To make it more clear I changed the while moving if outside to the following code:

  for (i <- 0 until  n) {
    val t = it.next
    buf += t
  }
  while (it.hasNext) {
    val t = it.next
    val (x, s) = seed.long
    val i = (x & 0x7fffffff).toInt % n
    if (i < n) buf(i) = t
    seed = s
  }

So now it becomes obvious that the else branch should at the same time decide if the current element should be added to the output buf AND which element there it should replace. Obviously, current code always selects each element because if (i < n) is always true given that i is calculated as something % n. And this is why you see such a huge skew to the last elements.

Obviously the plan was to use a modified version of Fisher–Yates shuffle that selects only first n elements of the shuffle and to do it properly you need to select random numbers in the range [0, count) and this is probably why the code is written the way it is written i.e. preserving counter inside while loop.

Using % count is still not quite correct because such a simple way doesn't produce uniform distribution when count is not a power of 2. To be more fair something like

    val c0 = choose(0, count-1)
    val rt: R[Int] = c0.doApply(p, seed)        
    seed = rt.seed      
    val i = rt.retrieve.get // index to swap current element with. Should be fair random number in range [0, count-1], see Fisher–Yates shuffle
    if (i < n) buf(i) = t

or some other way to create i as a fair uniformly distributed random number in such a range should be used.

Update (Why just % count is wrong)

You can look at java.util.Random.nextInt(int) implementation or org.scalacheck.Choose.chLng for an example of how it should be done. It is more complicated than just % count and there is a good reason for it. To illustrate it consider following example. Let's assume that your source random generator generates uniformly random 3-bit values i.e in range just [0, 7] and you want to get ranadom number in range [0, 2] and you do it by simply doing

srcGenerator.nextInt() % 3

Now consider mapping of values in range [0, 7] to your range [0, 2]:

  • 0, 3, 6 will be mapped to 0 (i.e. 3 values are mapped)
  • 1, 4, 7 will be mapped to 1 (i.e. 3 values are mapped)
  • 2, 5 will be mapped to 2 (i.e. only 2 values are mapped)

So if you do just % 3 your distribution will be 0 - 3/8, 1 - 3/8, 2 - 2/8 which is obviously not uniform. This is why those implementations I referenced earlier use some kind of a loop and discard some values generated by the source generator. It is required to produce unifrom distribution.

like image 157
SergGr Avatar answered Oct 12 '22 13:10

SergGr


I don't think it has anything to do with the seed. It has everything to do with Scalacheck's heurisitic.

There is a subtle bug. Consider what it's doing. It is forcing itself to choose values at the beginning, and than randomly overwriting them after that:

while (it.hasNext) {
  val t = it.next
  count += 1
  if (count <= n) {
    buf += t
  } else {
    val (x, s) = seed.long
    val i = (x & 0x7fffffff).toInt % n
    if (i < n) buf(i) = t
    seed = s
  }
  ...

It is randomly assigning those elements to the result in the else-block, that's why it is giving preference to the tail values.

So, pick is randomly selecting values from a set. However, it has sacrificed picking equally among the values and favors the end of the list because the code is trying to iterate through the list lazily.

To attempt to get an even distribution of picked elements, you would need to know the length of the collection, but as my answer suggests, that's not possible without consuming the iterable twice.

Maybe if you ran reverse on your list, or shuffle, you'd get a better distribution of picks with pick.

Because Scalacheck is a generic property testing library, I predict it can't do either of these things without sacrificing performance for arbitrary-sized collections.

Update

But as Alexey Romanov points out, this should implement the reservoir sampling algorithm which avoids knowing the length and can be run in O(n) time. There is simply a defect in the code. The fix is simply correcting the line for random number generation. It should get a random number from 1 to the kth (count) element visited in the list.

val i = (x & 0x7fffffff).toInt % n

Should be:

val i = (x & 0x7fffffff).toInt % count

I've submitted a PR to Scalacheck:

https://github.com/rickynils/scalacheck/pull/333

like image 32
ashawley Avatar answered Oct 12 '22 12:10

ashawley