Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I measure how my multithreaded code scales (speedup)?

What would be the best way to measure the speedup of my program assuming I only have 4 cores? Obviously I could measure it up to 4, however it would be nice to know for 8, 16, and so on.

Ideally I'd like to know the amount of speedup per number of thread, similar to this graph:

Amdahl's law diagram

Is there any way I can do this? Perhaps a method of simulating multiple cores?

like image 440
Pubby Avatar asked Mar 09 '12 22:03

Pubby


People also ask

How do I make multithreading faster?

The first rule of thumb for fast multithreading: to achieve maximum speed, the threads should communicate or share common resources as little as possible.

How can multithreading can speed up certain computing tasks?

Multithreading allows the app to still work properly while executing the task. This is the way to create concurrent computing in your device. It works as if there are multiple processors within the single processor core.

Does threading speed up a process?

Multithreading speeds up an application when you have more than one processor, or a processor with hyperthreading capability. Otherwise it can't. Are you asking if it's worth having multiple threads even if they can't run concurrently?

How does multithreading work?

With multithreading, while the computer system's processor executes one instruction at a time, different threads from multiple programs are executed so fast it appears the programs are executed simultaneously.


3 Answers

I'm sorry, but in my opinion, the only reliable measurement is to actually get an 8, 16 or more cores machine and test on that.

Memory bandwidth saturation, number of CPU functional units and other hardware bottlenecks can have a huge impact on scalability. I know from personal experience that if a program scales on 2 cores and on 4 cores, it might dramatically slow down when run on 8 cores, simply because it's not enough to have 8 cores to be able to scale 8x.

You could try to predict what will happen, but there are a lot of factors that need to be taken into account:

  1. caches - size, number of layers, shared / non-shared
  2. memory bandwidth
  3. number of cores vs. number of processors i.e. is it an 8-core machine or a dual-quad-core machine
  4. interconnection between cores - a lower number of cores (2, 4) can still work reasonably well with a bus, but for 8 or more cores a more sophisticated interconnection is needed.
  5. memory access - again, a lower number of cores work well with the SMP (symmetrical multiprocessing) model, while a higher number of core need a NUMA (non-uniform memory access) model.
like image 152
Tudor Avatar answered Sep 24 '22 23:09

Tudor


I do neither think that there is a real way to do this, but one thing which comes to my mind is that you could use a virtual machine to simulate more cores. In VirtualBox for example you can select up to 16 cores out of the standard menu, but I am very confident that there are some hacks, which can make more of that and other VirtualMachines like VMware might even support more out of the Box.

enter image description here

like image 28
Stephan Dollberg Avatar answered Sep 25 '22 23:09

Stephan Dollberg


bamboon and and doron are correct that many variables are at play, but if you have a tunable input size n, you can figure out the strong scaling and weak scaling of your code.

Strong scaling refers to fixing the problem size (e.g. n = 1M) and varying the number of threads available for computation. Weak scaling refers to fixing the problem size per thread (n = 10k/thread) and varying the number of threads available for computation.

It's true there's a lot of variables at work in any program -- however if you have some basic input size n, it's possible to get some semblance of scaling. On a n-body simulator I developed a few years back, I varied the threads for fixed size and the input size per thread and was able to reasonably calculate a rough measure of how well the multithreaded code scaled.

Since you only have 4 cores, you can only feasibly compute the scaling up to 4 threads. This severely limits your ability to see how well it scales to largely threaded loads. But this may not be an issue if your application is only used on machines where there are small core counts.

You really need to ask yourself the question: Is this going to be used on 10, 20, 40+ threads? If it is, the only way to accurately determine scaling to those regimes is to actually benchmark it on a platform where you have that hardware available.


Side note: Depending on your application, it may not matter that you only have 4 cores. Some workloads scale with increasing threads regardless of the real number of cores available, if many of those threads spend time "waiting" for something to happen (e.g. web servers). If you're doing pure computation though, this won't be the case

like image 33
Mike Bailey Avatar answered Sep 23 '22 23:09

Mike Bailey