Why aren't my scala futures more efficient?

Tags:

I'm running this scala code on a 32-bit quad-core Core2 system:

def job(i:Int,s:Int):Long = {
  val r=(i to 500000000 by s).map(_.toLong).foldLeft(0L)(_+_)
  println("Job "+i+" done")
  r
}

import scala.actors.Future
import scala.actors.Futures._

val JOBS=4

val jobs=(0 until JOBS).toList.map(i=>future {job(i,JOBS)})
println("Running...")
val results=jobs.map(f=>f())
println(results.foldLeft(0L)(_+_))

(Yes, I do know there are much more efficient ways to sum a series of integers; it's just to give the CPU something to do).

Depending on what I set JOBS to, the code runs in the following times:

JOBS=1 : 31.99user 0.84system 0:28.87elapsed 113%CPU
JOBS=2 : 27.71user 1.12system 0:14.74elapsed 195%CPU
JOBS=3 : 33.19user 0.39system 0:13.02elapsed 257%CPU
JOBS=4 : 49.08user 8.46system 0:22.71elapsed 253%CPU

I'm surprised that this doesn't really scale well beyond 2 futures "in play". I do a lot of multithreaded C++ code and have no doubt I'd get good scaling up to 4 cores and see >390% CPU utilisation if I coded this sort of thing with Intel's TBB or boost::threads (it'd be considerably more verbose of course).

So: what's going on and how can I get the scaling to 4 cores I'd expect to see ? Is this limited by something in scala or the JVM ? It occurs to me I don't actually know "where" scala's futures run... is a thread spawned per future, or does "Futures" provide a thread pool dedicated to running them ?

[I'm using the scala 2.7.7 packages from Debian/Squeeze on a Lenny system with sun-java6 (6-20-0lennny1).]

Update:

As suggested in Rex's answer, I recoded to avoid object creation.

def job(i:Long,s:Long):Long = {
  var t=0L
  var v=i
  while (v<=10000000000L) {
    t+=v
    v+=s
  }
  println("Job "+i+" done")
  t
}
// Rest as above...

This was so much faster I had to significantly increase the iteration count to run for any amount of time! Results are:

JOBS=1: 28.39user 0.06system 0:29.25elapsed 97%CPU
JOBS=2: 28.46user 0.04system 0:14.95elapsed 190%CPU
JOBS=3: 24.66user 0.06system 0:10.26elapsed 240%CPU
JOBS=4: 28.32user 0.12system 0:07.85elapsed 362%CPU

which is much more like what I'd hope to see (although the 3 jobs case is a little odd, with one task consistently completing a couple of seconds before the other two).

Pushing it a bit further, on a quad-core hyperthreaded i7 the latter version with JOBS=8 achieves an x4.4 speedup vs JOBS=1, with 571% CPU usage.

969

asked Sep 02 '10 13:09

timday

2 Answers

My guess is that the garbage collector is doing more work than the addition itself. So you're limited by what the garbage collector can manage. Try running the test again with something that doesn't create any objects (e.g. use a while loop instead of the range/map/fold). You can also play with the parallel GC options if your real application will hit the GC this heavily.

135

answered Nov 14 '22 09:11

Rex Kerr

Try

(i to 500000000 by s).view.map(_.toLong).foldLeft(0L)(_+_)

The application of view is supposed to (as I understood id) to avoid repeated iteration and object creation by providing simple wrappers.

Note also that you can use reduceLeft(_+_) instead of fold.

answered Nov 14 '22 09:11

Raphael

Related questions
                            
                                How does Thread.currentThread() work?
                            
                                Getting exclusive system-wide lock in Java
                            
                                How to apply InterLocked.Exchange for Enum Types in C#?
                            
                                What are the advantages of concurrent.futures over multiprocessing in Python?
                            
                                Advantages of Executors over new Thread
                            
                                Java: Is volatile / final required for reference to synchronized object?
                            
                                Writing a thread safe modular counter in Java
                            
                                Log4j: is it synchronized for multithreaded calls?
                            
                                How Do I Detect When a Client Thread Exits?
                            
                                Python ThreadPoolExecutor - is the callback guaranteed to run in the same thread as submitted func?
                            
                                Are valgrind "uninitialized value" warnings false positives in ATLAS multithreaded BLAS routines?
                            
                                Destruction of condition variable randomly loses notification
                            
                                Do spurious wakeups affect Thread.sleep?
                            
                                Using threading.Lock as context manager
                            
                                Python requests - threads/processes vs. IO
                            
                                Does the Java Memory Model (JSR-133) imply that entering a monitor flushes the CPU data cache(s)?
                            
                                the behavior of std::async with std::launch::async policy
                            
                                Timer already cancelled
                            
                                Unix pthreads and signals: per thread signal handlers
                            
                                In Java, how can I ensure safe and consistent concurrent usage of a boolean flag while minimizing time performance impact?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why aren't my scala futures more efficient?

Tags:

performance

multithreading

scalability

scala

future

timday

People also ask

2 Answers

Rex Kerr

Raphael

Recent Activity

Donate For Us