For a study project I have written a Scala application that uses a bunch of futures to do a parallel computation. I noticed that on my local machine (4 cores) the code runs faster than on the many-core server of our computer science institute (64 cores). Now I want to know why this is.
The task was to create random boolean k-CNF formulas with n different variables randomly distributed over m clauses and then see how at which m/n combination the probability that a formula is solvable drops below 50% for diffent random distributions. For this I have implemented a probabilistic k-SAT algorithm, a clause generator and some other code. The core is a function that takes n and m es well as the generator function, runs 100 futures and waits for the result. The function looks like this:
def avgNonvalidClauses(n: Int, m: Int)(implicit clauseGenerator: ClauseGenerator) = {
val startTime = System.nanoTime
/** how man iteration to build the average **/
val TRIES = 100
// do TRIES iterations in parallel
val tasks = for (i <- 0 until TRIES) yield future[Option[Config]] {
val clause = clauseGenerator(m, n)
val solution = CNFSolver.probKSat(clause)
solution
}
/* wait for all threads to finish and collect the results. we will only wait
* at most TRIES * 100ms (note: flatten filters out all
* None's) */
val results = awaitAll(100 * TRIES, tasks: _*).asInstanceOf[List[Option[Option[Config]]]].flatten
val millis = Duration(System.nanoTime - startTime, NANOSECONDS).toMillis
val avg = (results count (_.isDefined)) / results.length.toFloat
println(s"n=$n, m=$m => $avg ($millis ms)")
avg
}
On my local machine I get these results
[info] Running Main
n=20, m=120 => 0.0 (8885 ms)
n=21, m=121 => 0.0 (9115 ms)
n=22, m=122 => 0.0 (8724 ms)
n=23, m=123 => 0.0 (8433 ms)
n=24, m=124 => 0.0 (8544 ms)
n=25, m=125 => 0.0 (8858 ms)
[success] Total time: 53 s, completed Jan 9, 2013 8:21:30 PM
On the 64-core server I get:
[info] Running Main
n=20, m=120 => 0.0 (43200 ms)
n=21, m=121 => 0.0 (38826 ms)
n=22, m=122 => 0.0 (38728 ms)
n=23, m=123 => 0.0 (32737 ms)
n=24, m=124 => 0.0 (41196 ms)
n=25, m=125 => 0.0 (42323 ms)
[success] Total time: 245 s, completed 09.01.2013 20:28:22
However, I the full load on both machines (the server averages around at a load of 60 to 65) so there are running enough threads. Why is this? Am I doing something completely wrong?
My local machine has an "AMD Phenom(tm) II X4 955 Processor" CPU the server is uses "AMD Opteron(TM) Processor 6272". The local CPU has 6800 bogomips, the servers 4200. So, while the local CPU is a 1/3 faster, there are 12 times more cors on the server.
If have a trimmed down example of my code pushed to github so you can try for yourselve if you are intereste: https://github.com/Blattlaus/algodemo (It's an sbt project using Scala 2.10).
You have already figured out the answer in that the problem is Random.nextInt() which uses a AtomicLong(). If this is being accessed frequently from different threads then you will get cache thrashing, which will be worse on your 64 core computer because the caches will be further apart (electrically) and hence it will take longer to get the necessary cache line locks.
See this stackoverflow answer for more details, and the solution on how to avoid this problem (which is basically to use a thread local random number generator): Contention in concurrent use of java.util.Random
Operations on denormalized floating numbers, could take an order of magnitude longer on x86 architecture. See:
Why does changing 0.1f to 0 slow down performance by 10x?
Haven't examined your code, but given that you return NaN
that might be the case. Try removing randomness from your test to verify that hypothesis.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With