What would be the best way to measure the speedup of my program assuming I only have 4 cores? Obviously I could measure it up to 4, however it would be nice to know for 8, 16, and so on. Ideally I'd like to know the amount of speedup per number of thread, similar to this graph: <img src="https://i.stack.imgur.com/HTTmC.png" alt="Amdahl's law diagram"> Is there any way I can do this? Perhaps a method of simulating multiple cores?

I'm sorry, but in my opinion, the only reliable measurement is to actually get an 8, 16 or more cores machine and test on that. Memory bandwidth saturation, number of CPU functional units and other hardware bottlenecks can have a huge impact on scalability. I know from personal experience that if a program scales on 2 cores and on 4 cores, it might dramatically slow down when run on 8 cores, simply because it's not enough to have 8 cores to be able to scale 8x. You could try to predict what will happen, but there are a lot of factors that need to be taken into account: <ol> <li>caches - size, number of layers, shared / non-shared</li> <li>memory bandwidth</li> <li>number of cores vs. number of processors i.e. is it an 8-core machine or a dual-quad-core machine</li> <li>interconnection between cores - a lower number of cores (2, 4) can still work reasonably well with a bus, but for 8 or more cores a more sophisticated interconnection is needed.</li> <li>memory access - again, a lower number of cores work well with the SMP (symmetrical multiprocessing) model, while a higher number of core need a NUMA (non-uniform memory access) model.</li> </ol>

bamboon and and doron are correct that many variables are at play, but if you have a tunable input size <code>n</code>, you can figure out the strong scaling and weak scaling of your code. Strong scaling refers to fixing the problem size (e.g. <code>n = 1M</code>) and varying the number of threads available for computation. Weak scaling refers to fixing the problem size per thread (<code>n = 10k/thread</code>) and varying the number of threads available for computation. It's true there's a lot of variables at work in any program -- however if you have some basic input size <code>n</code>, it's possible to get some semblance of scaling. On a n-body simulator I developed a few years back, I varied the threads for fixed size and the input size per thread and was able to reasonably calculate a rough measure of how well the multithreaded code scaled. Since you only have 4 cores, you can only feasibly compute the scaling up to 4 threads. This severely limits your ability to see how well it scales to largely threaded loads. But this may not be an issue if your application is only used on machines where there are small core counts. You really need to ask yourself the question: Is this going to be used on 10, 20, 40+ threads? If it is, the only way to accurately determine scaling to those regimes is to actually benchmark it on a platform where you have that hardware available. <hr> Side note: Depending on your application, it may not matter that you only have 4 cores. Some workloads scale with increasing threads regardless of the real number of cores available, if many of those threads spend time "waiting" for something to happen (e.g. web servers). If you're doing pure computation though, this won't be the case

How can I measure how my multithreaded code scales (speedup)?

3 Answers

I'm sorry, but in my opinion, the only reliable measurement is to actually get an 8, 16 or more cores machine and test on that.

Memory bandwidth saturation, number of CPU functional units and other hardware bottlenecks can have a huge impact on scalability. I know from personal experience that if a program scales on 2 cores and on 4 cores, it might dramatically slow down when run on 8 cores, simply because it's not enough to have 8 cores to be able to scale 8x.

You could try to predict what will happen, but there are a lot of factors that need to be taken into account:

caches - size, number of layers, shared / non-shared
memory bandwidth
number of cores vs. number of processors i.e. is it an 8-core machine or a dual-quad-core machine
interconnection between cores - a lower number of cores (2, 4) can still work reasonably well with a bus, but for 8 or more cores a more sophisticated interconnection is needed.
memory access - again, a lower number of cores work well with the SMP (symmetrical multiprocessing) model, while a higher number of core need a NUMA (non-uniform memory access) model.

152

answered Sep 24 '22 23:09

Tudor

I do neither think that there is a real way to do this, but one thing which comes to my mind is that you could use a virtual machine to simulate more cores. In VirtualBox for example you can select up to 16 cores out of the standard menu, but I am very confident that there are some hacks, which can make more of that and other VirtualMachines like VMware might even support more out of the Box.

enter image description here

answered Sep 25 '22 23:09

Stephan Dollberg

bamboon and and doron are correct that many variables are at play, but if you have a tunable input size n, you can figure out the strong scaling and weak scaling of your code.

Strong scaling refers to fixing the problem size (e.g. n = 1M) and varying the number of threads available for computation. Weak scaling refers to fixing the problem size per thread (n = 10k/thread) and varying the number of threads available for computation.

It's true there's a lot of variables at work in any program -- however if you have some basic input size n, it's possible to get some semblance of scaling. On a n-body simulator I developed a few years back, I varied the threads for fixed size and the input size per thread and was able to reasonably calculate a rough measure of how well the multithreaded code scaled.

Since you only have 4 cores, you can only feasibly compute the scaling up to 4 threads. This severely limits your ability to see how well it scales to largely threaded loads. But this may not be an issue if your application is only used on machines where there are small core counts.

You really need to ask yourself the question: Is this going to be used on 10, 20, 40+ threads? If it is, the only way to accurately determine scaling to those regimes is to actually benchmark it on a platform where you have that hardware available.

Side note: Depending on your application, it may not matter that you only have 4 cores. Some workloads scale with increasing threads regardless of the real number of cores available, if many of those threads spend time "waiting" for something to happen (e.g. web servers). If you're doing pure computation though, this won't be the case

answered Sep 23 '22 23:09

Mike Bailey

Related questions
                            
                                Vectorizing (SIMD) Tree operations
                            
                                Where can I find documentation on C++ memory alignment across different platforms/compilers?
                            
                                Embed HTML browser into a native C++/Win32 project using Visual Studio
                            
                                Can std::forward_list members be implemented as static?
                            
                                Why can extern be applied to definitions?
                            
                                can c/c++ do preemeptive multitasking in a single thread? [closed]
                            
                                c++ Dependency Injection: Object lifetimes?
                            
                                Disadvantages of sleeping bodies in Box2D
                            
                                Can I call CloseHandle() immediately after _beginthreadex() succeeded?
                            
                                How to efficiently build a Python dictionary in C++
                            
                                C++ Parsing Library with UTF-8 support
                            
                                error while using mutex
                            
                                Undefined reference to a static local variable
                            
                                Is this legal C/C++? `int* p = (int[]) {1,2,3} ;`
                            
                                How do you properly link to a static library using g++
                            
                                How to write qHash for a QSet<SomeClass*> container?
                            
                                "module could not be found" error with C++/CLI component in ASP.NET
                            
                                Python + alglib + NumPy: how to avoid converting arrays to lists?
                            
                                Prevent expression templates binding to rvalue references
                            
                                how to read image files and store it in memory(std::string) in c++?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I measure how my multithreaded code scales (speedup)?

Tags:

c++

multithreading

profiling

parallel-processing

Pubby

People also ask

3 Answers

Tudor

Stephan Dollberg

Mike Bailey

Recent Activity

Donate For Us