I want to randomly sample from a Scala list or array (not an RDD), the sample size can be much longer than the length of the list or array, how can I do this efficiently? Because the sample size can be very big and the sampling (on different lists/arrays) needs to be done a large number of times.
I know for a Spark RDD we can use takeSample() to do it, is there an equivalent for Scala list/array?
Thank you very much.
Seed is random number generator seed. This is important because you might want to be able to hard code the same seed for your tests so that you always get the same results in test, but in prod code replace it with current time in milliseconds or a random number from a good entropy source.
An easy-to-understand version would look like this:
import scala.util.Random
Random.shuffle(list).take(n)
Random.shuffle(array.toList).take(n)
// Seeded version
val r = new Random(seed)
r.shuffle(...)
For arrays:
import scala.util.Random
import scala.reflect.ClassTag
def takeSample[T:ClassTag](a:Array[T],n:Int,seed:Long) = {
val rnd = new Random(seed)
Array.fill(n)(a(rnd.nextInt(a.size)))
}
Make a random number generator (rnd
) based on your seed. Then, fill an array with random numbers from 0 until the size of your array.
The last step is applying each random value to the indexing operator of your input array. Using it in the REPL could look as follows:
scala> val myArray = Array(1,3,5,7,8,9,10)
myArray: Array[Int] = Array(1, 3, 5, 7, 8, 9, 10)
scala> takeSample(myArray,20,System.currentTimeMillis)
res0: scala.collection.mutable.ArraySeq[Int] = ArraySeq(7, 8, 7, 3, 8, 3, 9, 1, 7, 10, 7, 10,
1, 1, 3, 1, 7, 1, 3, 7)
For lists, I would simply convert the list to Array and use the same function. I doubt you can get much more efficient for lists anyway.
It is important to note, that the same function using lists would take O(n^2) time, whereas converting the list to arrays first will take O(n) time
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With