Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala Futures are slow with many cores

For a study project I have written a Scala application that uses a bunch of futures to do a parallel computation. I noticed that on my local machine (4 cores) the code runs faster than on the many-core server of our computer science institute (64 cores). Now I want to know why this is.

Task in Detail

The task was to create random boolean k-CNF formulas with n different variables randomly distributed over m clauses and then see how at which m/n combination the probability that a formula is solvable drops below 50% for diffent random distributions. For this I have implemented a probabilistic k-SAT algorithm, a clause generator and some other code. The core is a function that takes n and m es well as the generator function, runs 100 futures and waits for the result. The function looks like this:

Code in question

def avgNonvalidClauses(n: Int, m: Int)(implicit clauseGenerator: ClauseGenerator) = {

    val startTime = System.nanoTime

    /** how man iteration to build the average **/
    val TRIES = 100

    // do TRIES iterations in parallel 
    val tasks = for (i <- 0 until TRIES) yield future[Option[Config]] {
        val clause = clauseGenerator(m, n)
        val solution = CNFSolver.probKSat(clause)
        solution
    }

    /* wait for all threads to finish and collect the results. we will only wait
     * at most TRIES * 100ms (note: flatten filters out all
     * None's) */
    val results = awaitAll(100 * TRIES, tasks: _*).asInstanceOf[List[Option[Option[Config]]]].flatten

    val millis = Duration(System.nanoTime - startTime, NANOSECONDS).toMillis
    val avg = (results count (_.isDefined)) /  results.length.toFloat

    println(s"n=$n, m=$m => $avg ($millis ms)")

    avg
  }

Problem

On my local machine I get these results

[info] Running Main 
n=20, m=120 => 0.0 (8885 ms)
n=21, m=121 => 0.0 (9115 ms)
n=22, m=122 => 0.0 (8724 ms)
n=23, m=123 => 0.0 (8433 ms)
n=24, m=124 => 0.0 (8544 ms)
n=25, m=125 => 0.0 (8858 ms)
[success] Total time: 53 s, completed Jan 9, 2013 8:21:30 PM

On the 64-core server I get:

[info] Running Main 
n=20, m=120 => 0.0 (43200 ms)
n=21, m=121 => 0.0 (38826 ms)
n=22, m=122 => 0.0 (38728 ms)
n=23, m=123 => 0.0 (32737 ms)
n=24, m=124 => 0.0 (41196 ms)
n=25, m=125 => 0.0 (42323 ms)
[success] Total time: 245 s, completed 09.01.2013 20:28:22

However, I the full load on both machines (the server averages around at a load of 60 to 65) so there are running enough threads. Why is this? Am I doing something completely wrong?

My local machine has an "AMD Phenom(tm) II X4 955 Processor" CPU the server is uses "AMD Opteron(TM) Processor 6272". The local CPU has 6800 bogomips, the servers 4200. So, while the local CPU is a 1/3 faster, there are 12 times more cors on the server.

Additional

If have a trimmed down example of my code pushed to github so you can try for yourselve if you are intereste: https://github.com/Blattlaus/algodemo (It's an sbt project using Scala 2.10).

Updates

  1. I've eleminated any randomness by seeding the random number generators with 42. This changes nothing
  2. I've changed the testset. Now the results are even more astonishing (the server is 5 times slower!) Note: all outputs for the average percentage of not solvable clauses are zeor because of the input. This is normal and expected.
  3. Added info about CPUs
  4. I've noticed that calls to Random.nextInt() are a factor of 10 slower on the Server. I have wrapped all calls in a helper that measures the runtime a prints is to the console if they are slower then 10ms. On my local machine i get a few, and the typically are araound 10-20ms. On the server I get much mure calls and they tend to be above 100ms. Could this be the issue???
like image 298
Martin Thurau Avatar asked Jan 09 '13 18:01

Martin Thurau


2 Answers

You have already figured out the answer in that the problem is Random.nextInt() which uses a AtomicLong(). If this is being accessed frequently from different threads then you will get cache thrashing, which will be worse on your 64 core computer because the caches will be further apart (electrically) and hence it will take longer to get the necessary cache line locks.

See this stackoverflow answer for more details, and the solution on how to avoid this problem (which is basically to use a thread local random number generator): Contention in concurrent use of java.util.Random

like image 51
Rob Wilton Avatar answered Nov 16 '22 02:11

Rob Wilton


Operations on denormalized floating numbers, could take an order of magnitude longer on x86 architecture. See:

Why does changing 0.1f to 0 slow down performance by 10x?

Haven't examined your code, but given that you return NaN that might be the case. Try removing randomness from your test to verify that hypothesis.

like image 24
Jakozaur Avatar answered Nov 16 '22 00:11

Jakozaur