Scala parallel collection runtime puzzling

Question

Edit: My sample size was too small. When I ran it against the real data on 8 CPU's, I saw a 7.2x speed increase. Not too shabby for adding 4 characters to my code ;)

I am currently in the process of trying to 'sell' management on the benefits of using Scala, especially when it comes to scaling with CPU's. To that end, I created a simple test application that does a bunch of vector math and was a bit surprised to find that the runtime was not noticably better on my quad-core machine. Interestingly enough, I found that the runtime is the worst the first time you go through the collection and gets better with subsequent calls. Are there some lazy things in the parallel collection that are causing this, or am I just doing this wrong? It should be noted that I come from the C++/C# world, so it's entirely possible that I have messed up my configuration somehow. Regardless, here's my setup:

InteliJ Scala Plugin

Scala 2.9.1.final

Windows 7 64 bit, Quad-Core Processor (no hyperthreading)

import util.Random

  // simple Vector3D class that has final x,y,z components a length, and a '-' function
  class Vector3D(val x:Double,  val y:Double, val z:Double)
  {
    def length = math.sqrt(x*x+y*y+z*z)
    def -(rhs : Vector3D ) = new Vector3D(x - rhs.x, y - rhs.y, z - rhs.z)
  }

object MainClass {

  def main(args : Array[String]) =
  {
    println("Available CPU's: " + Runtime.getRuntime.availableProcessors())
    println("Parallelism Degree set to: " + collection.parallel.ForkJoinTasks.defaultForkJoinPool.getParallelism);
    // my position
    val myPos = new Vector3D(0,0,0);

    val r = new Random(0);

    // define a function nextRand that gets us a random between 0 and 100
    def nextRand = r.nextDouble() * 100;

    // make 10 million random targets
    val targets = (0 until 10000000).map(_ => new Vector3D(nextRand, nextRand, nextRand)).toArray
    // take the .par hit before we start profiling
    val parTargets = targets.par

    println("Created " + targets.length + " vectors")

    // define a range function
    val rangeFunc : (Vector3D => Double) = (targetPos) => (targetPos - myPos).length

    // we'll select ones that are <50
    val within50 : (Vector3D => Boolean) = (targetPos) => rangeFunc(targetPos) < 50

    // time it sequentially
    val startTime_sequential = System.currentTimeMillis()
    val numTargetsInRange_sequential = targets.filter(within50)
    val endTime_sequential = System.currentTimeMillis()
    println("Sequential (ms): " + (endTime_sequential - startTime_sequential))

    // do the parallel version 10 times
    for(i <- 1 to 10)
    {

      val startTime_par = System.currentTimeMillis()
      val numTargetsInRange_parallel = parTargets.filter(within50)
      val endTime_par = System.currentTimeMillis()

      val ms = endTime_par - startTime_par;
      println("Iteration[" + i + "] Executed in " + ms + " ms")
    }
  }
}

The output of this program is:

Available CPU's: 4
Parallelism Degree set to: 4
Created 10000000 vectors
Sequential (ms): 216
Iteration[1] Executed in 227 ms
Iteration[2] Executed in 253 ms
Iteration[3] Executed in 76 ms
Iteration[4] Executed in 78 ms
Iteration[5] Executed in 77 ms
Iteration[6] Executed in 80 ms
Iteration[7] Executed in 78 ms
Iteration[8] Executed in 78 ms
Iteration[9] Executed in 79 ms
Iteration[10] Executed in 82 ms

So what's going on here? The first 2 times we do the filter, it's slower, and then things speed up? I understand that there will inherently be a parallelism startup cost, I'm just trying to figure out where it makes sense to express the parallelism in my applicaion, and specifically I want to be able to show Management a program that runs 3-4 times faster on a Quad core box. Is this just not a good problem?

Ideas?

Jed Wesley-Smith · Accepted Answer

You have the micro-benchmark disease. You are most likely benchmarking the JIT compile phase. You'll need to warm up your JIT with a pre-run first.

Probably the best idea is to use a micro-benchmarking framework like http://code.google.com/p/caliper/ which handles all that for you.

Edit: There is a nice SBT Template for Caliper benchmarking Scala projects, as referenced from this blog post

Scala parallel collection runtime puzzling

Tags:

parallel-processing

scalability

scala

runtime

multicore

fbl

1 Answers

Jed Wesley-Smith

Recent Activity

Donate For Us

Scala parallel collection runtime puzzling

Tags:

parallel-processing

scalability

scala

runtime

multicore

fbl

1 Answers

Jed Wesley-Smith

Related questions

Recent Activity

Donate For Us