I have written a method that must consider a random number to simulate a Bernoulli distribution. I am using <code>random.nextDouble</code> to generate a number between 0 and 1 then making my decision based on that value given my probability parameter. My problem is that Spark is generating the same random numbers within each iteration of my for loop mapping function. I am using the <code>DataFrame</code> API. My code follows this format: <pre class="prettyprint"><code>val myClass = new MyClass() val M = 3 val myAppSeed = 91234 val rand = new scala.util.Random(myAppSeed) for (m <- 1 to M) { val newDF = sqlContext.createDataFrame(myDF .map{row => RowFactory .create(row.getString(0), myClass.myMethod(row.getString(2), rand.nextDouble()) }, myDF.schema) } </code></pre> Here is the class: <pre class="prettyprint"><code>class myClass extends Serializable { val q = qProb def myMethod(s: String, rand: Double) = { if (rand <= q) // do something else // do something else } } </code></pre> I need a new random number every time <code>myMethod</code> is called. I also tried generating the number inside my method with <code>java.util.Random</code> (<code>scala.util.Random</code> v10 does not extend <code>Serializable</code>) like below, but I'm still getting the same numbers within each for loop <pre class="prettyprint"><code>val r = new java.util.Random(s.hashCode.toLong) val rand = r.nextDouble() </code></pre> I've done some research, and it seems this has do to with Sparks deterministic nature.

According to this post, the best solution is not to put the <code>new scala.util.Random</code> inside the map, nor completely outside (ie. in the driver code), but in an intermediate <code>mapPartitionsWithIndex</code>: <pre class="prettyprint"><code>import scala.util.Random val myAppSeed = 91234 val newRDD = myRDD.mapPartitionsWithIndex { (indx, iter) => val rand = new scala.util.Random(indx+myAppSeed) iter.map(x => (x, Array.fill(10)(rand.nextDouble))) } </code></pre>

Spark - Random Number Generation

Tags:

random

scala

apache-spark

spark-dataframe

I have written a method that must consider a random number to simulate a Bernoulli distribution. I am using random.nextDouble to generate a number between 0 and 1 then making my decision based on that value given my probability parameter.

My problem is that Spark is generating the same random numbers within each iteration of my for loop mapping function. I am using the DataFrame API. My code follows this format:

val myClass = new MyClass()
val M = 3
val myAppSeed = 91234
val rand = new scala.util.Random(myAppSeed)

for (m <- 1 to M) {
  val newDF = sqlContext.createDataFrame(myDF
    .map{row => RowFactory
      .create(row.getString(0),
        myClass.myMethod(row.getString(2), rand.nextDouble())
    }, myDF.schema)
}

Here is the class:

class myClass extends Serializable {
  val q = qProb

  def myMethod(s: String, rand: Double) = {
    if (rand <= q) // do something
    else // do something else
  }
}

I need a new random number every time myMethod is called. I also tried generating the number inside my method with java.util.Random (scala.util.Random v10 does not extend Serializable) like below, but I'm still getting the same numbers within each for loop

val r = new java.util.Random(s.hashCode.toLong)
val rand = r.nextDouble()

I've done some research, and it seems this has do to with Sparks deterministic nature.

999

asked Apr 06 '16 15:04

Brian

2 Answers

Just use the SQL function rand:

import org.apache.spark.sql.functions._

//df: org.apache.spark.sql.DataFrame = [key: int]

df.select($"key", rand() as "rand").show
+---+-------------------+
|key|               rand|
+---+-------------------+
|  1| 0.8635073400704648|
|  2| 0.6870153659986652|
|  3|0.18998048357873532|
+---+-------------------+


df.select($"key", rand() as "rand").show
+---+------------------+
|key|              rand|
+---+------------------+
|  1|0.3422484248879837|
|  2|0.2301384925817671|
|  3|0.6959421970071372|
+---+------------------+

104

answered Sep 21 '22 17:09

David Griffin

According to this post, the best solution is not to put the new scala.util.Random inside the map, nor completely outside (ie. in the driver code), but in an intermediate mapPartitionsWithIndex:

import scala.util.Random
val myAppSeed = 91234
val newRDD = myRDD.mapPartitionsWithIndex { (indx, iter) =>
   val rand = new scala.util.Random(indx+myAppSeed)
   iter.map(x => (x, Array.fill(10)(rand.nextDouble)))
}

answered Sep 17 '22 17:09

leo9r

Related questions
                            
                                Split Akka Stream Source into two
                            
                                Is there a "SELF" type in scala that represents the current type?
                            
                                How to perform pattern matching with vararg case classes?
                            
                                scala game programming: advancing object position in a functional style
                            
                                'val' or 'var', mutable or immutable?
                            
                                cassandra with scala
                            
                                Better version of "iterate over Seq or if empty" in scala?
                            
                                how to use asInstanceOf properly in Scala
                            
                                How to schedule an hourly job with Play Framework 2.1?
                            
                                Idiomatic way of treating Option[Boolean]
                            
                                Play framework input without label
                            
                                Try / Option with null
                            
                                Are polymorphic functions "restrictive" in Scala?
                            
                                akka: how to test that an actor was stopped
                            
                                Spark converting a Dataset to RDD
                            
                                Spark dataframe write method writing many small files
                            
                                Getting a Scala Map from a Java Properties
                            
                                LRUCache in Scala?
                            
                                How to find file size in scala?
                            
                                How do compiled queries in slick actually work?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With