Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to randomly sample from a Scala list or array?

I want to randomly sample from a Scala list or array (not an RDD), the sample size can be much longer than the length of the list or array, how can I do this efficiently? Because the sample size can be very big and the sampling (on different lists/arrays) needs to be done a large number of times.

I know for a Spark RDD we can use takeSample() to do it, is there an equivalent for Scala list/array?

Thank you very much.

like image 695
Carter Avatar asked Oct 04 '15 09:10

Carter


People also ask

What is seed in sample Spark?

Seed is random number generator seed. This is important because you might want to be able to hard code the same seed for your tests so that you always get the same results in test, but in prod code replace it with current time in milliseconds or a random number from a good entropy source.


2 Answers

An easy-to-understand version would look like this:

import scala.util.Random

Random.shuffle(list).take(n)
Random.shuffle(array.toList).take(n)

// Seeded version
val r = new Random(seed)
r.shuffle(...)
like image 131
Marius Soutier Avatar answered Oct 22 '22 08:10

Marius Soutier


For arrays:

import scala.util.Random
import scala.reflect.ClassTag

def takeSample[T:ClassTag](a:Array[T],n:Int,seed:Long) = {
  val rnd = new Random(seed)
  Array.fill(n)(a(rnd.nextInt(a.size)))
}

Make a random number generator (rnd) based on your seed. Then, fill an array with random numbers from 0 until the size of your array.

The last step is applying each random value to the indexing operator of your input array. Using it in the REPL could look as follows:

scala> val myArray = Array(1,3,5,7,8,9,10)
myArray: Array[Int] = Array(1, 3, 5, 7, 8, 9, 10)

scala> takeSample(myArray,20,System.currentTimeMillis)
res0: scala.collection.mutable.ArraySeq[Int] = ArraySeq(7, 8, 7, 3, 8, 3, 9, 1, 7, 10, 7, 10,
1, 1, 3, 1, 7, 1, 3, 7)

For lists, I would simply convert the list to Array and use the same function. I doubt you can get much more efficient for lists anyway.

It is important to note, that the same function using lists would take O(n^2) time, whereas converting the list to arrays first will take O(n) time

like image 30
Felix Avatar answered Oct 22 '22 10:10

Felix