I need lots of random numbers, one per line. The result should be something like this: <pre class="prettyprint"><code>24324 24324 4234234 4234234 1310313 1310313 ... </code></pre> So I wrote this spark code (Sorry I'm new to Spark and scala): <pre class="prettyprint"><code>import util.Random import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object RandomIntegerWriter { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: RandomIntegerWriter <num Integers> <outDir>") System.exit(1) } val conf = new SparkConf().setAppName("Spark RandomIntegerWriter") val spark = new SparkContext(conf) val distData = spark.parallelize(Seq.fill(args(0).toInt)(Random.nextInt)) distData.saveAsTextFile(args(1)) spark.stop() } } </code></pre> Notes: Now I just want to generate one number per line. But it seems that when number of numbers gets larger, the program will report an error. Any idea with this piece of code? Thank you.

try <pre class="prettyprint"><code>val distData = spark.parallelize(Seq[Int](), numPartitions) .mapPartitions { _ => { (1 to recordsPerPartition).map{_ => Random.nextInt}.iterator }} </code></pre> It will create an empty collection in driver side, but generate many random integers in worker side. Total number of records is: <code>numPartitions * recordsPerPartition</code>

How to use spark to generate huge amount of random integers?

Tags:

scala

apache-spark

I need lots of random numbers, one per line. The result should be something like this:

24324 24324
4234234 4234234
1310313 1310313
...

So I wrote this spark code (Sorry I'm new to Spark and scala):

import util.Random

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object RandomIntegerWriter {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: RandomIntegerWriter <num Integers> <outDir>")
      System.exit(1)
    }
    val conf = new SparkConf().setAppName("Spark RandomIntegerWriter")
    val spark = new SparkContext(conf)
    val distData = spark.parallelize(Seq.fill(args(0).toInt)(Random.nextInt))
    distData.saveAsTextFile(args(1))
    spark.stop()
  }
}

Notes: Now I just want to generate one number per line.

But it seems that when number of numbers gets larger, the program will report an error. Any idea with this piece of code?

Thank you.

630

asked Mar 16 '15 03:03

Haoliang

3 Answers

In Spark 1.4 you can use the DataFrame API to do this:

In [1]: from pyspark.sql.functions import rand, randn
In [2]: # Create a DataFrame with one int column and 10 rows.
In [3]: df = sqlContext.range(0, 10)
In [4]: df.show()
+--+
|id|
+--+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+--+

In [4]: # Generate two other columns using uniform distribution and normal distribution.
In [5]: df.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal")).show()
+--+-------------------+--------------------+
|id|            uniform|              normal|
+--+-------------------+--------------------+
| 0| 0.7224977951905031| -0.1875348803463305|
| 1| 0.2953174992603351|-0.26525647952450265|
| 2| 0.4536856090041318| -0.7195024130068081|
| 3| 0.9970412477032209|  0.5181478766595276|
| 4|0.19657711634539565|  0.7316273979766378|
| 5|0.48533720635534006| 0.07724879367590629|
| 6| 0.7369825278894753| -0.5462256961278941|
| 7| 0.5241113627472694| -0.2542275002421211|
| 8| 0.2977697066654349| -0.5752237580095868|
| 9| 0.5060159582230856|  1.0900096472044518|
+--+-------------------+--------------------+

answered Nov 17 '22 02:11

vmhacker

try

val distData = spark.parallelize(Seq[Int](), numPartitions)
  .mapPartitions { _ => {
    (1 to recordsPerPartition).map{_ => Random.nextInt}.iterator
  }}

It will create an empty collection in driver side, but generate many random integers in worker side. Total number of records is: numPartitions * recordsPerPartition

answered Nov 17 '22 04:11

cloud

Running on a Spark Cluster

The current version is materializing the collection of random numbers in the memory of the driver. If that collection is very large, the driver will run out of memory. Note that that version does not make use of Spark's processing capabilities as it's only using it to save the data after it's created.

Assuming we are working on a cluster, what we need to do is to distribute the work required to generate the data among the executors. One way of doing that would be transforming the original algorithm in a version that can work across the cluster by dividing the work among executors:

val numRecords:Int = ???
val partitions:Int = ???
val recordsPerPartition = numRecords / partitions // we are assuming here that numRecords is divisible by partitions, otherwise we need to compensate for the residual 

val seedRdd = sparkContext.parallelize(Seq.fill(partitions)(recordsPerPartition),partitions)
val randomNrs = seedRdd.flatMap(records => Seq.fill(records)(Random.nextInt))
randomNrs.saveAsTextFile(...)

Running on a single machine

If we don't have a cluster, and this is meant to run on a single machine, the question would be "why use Spark?". This random generator process is basically I/O bound and could be done in O(1) of memory by sequentially writing random numbers to a file.

import java.io._
def randomFileWriter(file:String, records:Long):Unit = {
    val pw = new PrintWriter(new BufferedWriter(new FileWriter(file)))
    def loop(count:Int):Unit = {
        if (count <= 0) () else {    
          pw.println(Random.nextInt)
          writeRandom(writer, count-1)
        }
    }
    loop(records)
    pw.close
}

answered Nov 17 '22 02:11

maasg

Related questions
                            
                                Spark 2 Dataset Null value exception
                            
                                How do you deal with futures in Akka Flow?
                            
                                Add column names to data read from csv file without column names
                            
                                Scala case class uses shallow copy or deep copy?
                            
                                Why is println considered an impure function?
                            
                                How to know if a Scala file modified with IntelliJ Idea is saved and if it is checked into CVS?
                            
                                Decreasing for loop in Scala?
                            
                                Why no StringBuilder.+=(String) in Scala?
                            
                                Scala Play 2.2 application crashes after deploying in Heroku: target/start No such file or directory
                            
                                Cannot resolve column (numeric column name) in Spark Dataframe
                            
                                How does Scala scale on a cluster?
                            
                                Testing multiple data sets with ScalaTest
                            
                                Why are messages received by an actor unordered?
                            
                                Converting Option[T] to Option[U] in Scala
                            
                                How to combine mvn clean and mvn install into a single task?
                            
                                DB Plugin is not registered in Play 2.0
                            
                                Add additional directory to clean task in SBT build
                            
                                play2 framework my template is not seen. : package views.html does not exist
                            
                                Custom Json Writes with combinators - not all the fields of the case class are needed
                            
                                Triple quotes in Java like Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With