Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What limits scaling in this simple OpenMP program?

I'm trying to understand limits to parallelization on a 48-core system (4xAMD Opteron 6348, 2.8 Ghz, 12 cores per CPU). I wrote this tiny OpenMP code to test the speedup in what I thought would be the best possible situation (the task is embarrassingly parallel):

// Compile with: gcc scaling.c -std=c99 -fopenmp -O3                                                                                               

#include <stdio.h>
#include <stdint.h>

int main(){

  const uint64_t umin=1;
  const uint64_t umax=10000000000LL;
  double sum=0.;
#pragma omp parallel for reduction(+:sum)
  for(uint64_t u=umin; u<umax; u++)
    sum+=1./u/u;
  printf("%e\n", sum);

}

I was surprised to find that the scaling is highly nonlinear. It takes about 2.9s for the code to run with 48 threads, 3.1s with 36 threads, 3.7s with 24 threads, 4.9s with 12 threads, and 57s for the code to run with 1 thread.

Unfortunately I have to say that there is one process running on the computer using 100% of one core, so that might be affecting it. It's not my process, so I can't end it to test the difference, but somehow I doubt that's making the difference between a 19~20x speedup and the ideal 48x speedup.

To make sure it wasn't an OpenMP issue, I ran two copies of the program at the same time with 24 threads each (one with umin=1, umax=5000000000, and the other with umin=5000000000, umax=10000000000). In that case both copies of the program finish after 2.9s, so it's exactly the same as running 48 threads with a single instance of the program.

What's preventing linear scaling with this simple program?

like image 986
Douglas B. Staple Avatar asked Nov 05 '13 01:11

Douglas B. Staple


1 Answers

I'm not sure this qualifies as an answer but it feels like more than a comment, so here we go.

I've never noticed particularly linear performance against the number of threads in any of my projects. For one thing, there's the scheduler, which is anything but rigorously fair, seems to me. OpenMP probably divides the task evenly among its team of threads at the outset, then joins each. On every Linux box I've had the pleasure of, I would expect a few threads to finish early, and a few threads to lag. Other platforms will vary. However that works out, of course you're waiting for the slowest to catch up. So stochastically speaking, there's a pulse of threading processing going by in something of a bell curve, the more threads the wider I should think, and you're never done until the trailing edge crosses the finish line.

What does top say? Does it tell you your process gets 2000% CPU at 20 threads, 4000% at 40? I bet it tapers off. htop by the way typically shows a process total, and separate lines for each thread. That might be interesting to watch.

With a tiny loop like that, you're probably not running into cache thrash or any such annoyance. But another issue that's bound to shave some performance off the top: like any modern multi-core CPU the Opteron runs at a higher clock rate when it's cool. The more cores you heat up, the less turbo mode you'll see.

like image 87
Salt Avatar answered Sep 20 '22 00:09

Salt