Will multithreading improve performance significantly if I have a fixed amount of calculations that are independet from each other?

Tags:

I am programming a raycasting game engine. Each ray can be calculated without knowing anything about the other rays (I'm only calculating distances). Since there is no waiting time between calculations, I wonder whether it's worth the effort to make the ray calculations multithreaded or not. Is it likely that there will be a performance boost?

766

asked Aug 18 '14 18:08

Marius Anderie

2 Answers

Mostly likely multi-threading will improve performance if done correctly. The way you've stated your problem, it is a perfect candidate for multi-threading since the computations are independent, reducing the need for coordination between threads to a minimum.

Some reasons you still might not get a speed up, or may not get the full speed up you expect could include:

1) The bottleneck may not be on-die CPU execution resources (e.g., ALU-bound operations), but rather something shared like memory or shared LLC bandwidth.

For example, on some architectures, a single thread may be able to saturate memory bandwidth, so adding more cores may not help. A more common case is that a single core can saturate some fraction, 1/N < 1 of main memory bandwidth, and this value is larger than 1/C where C is the core count. For instance, on a 4 core box, one core may be able to consume 50% of the bandwidth. Then, for a memory-bound computation, you'll get good scaling to 2 cores (using 100% of bandwidth), but little to none above that.

Other resources which are shared among cores include disk and network IO, GPU, snoop bandwidth, etc. If you have a hyper-threaded platform, this list increases to include all levels of cache and ALU resources for logical cores sharing the same physical core.

2) Contention "in practice" between operations which are "theoretically" independent.

You mention that your operations are independent. Typically this means that they are logically independent - they don't share any data (other than perhaps immutable input) and they can write to separate output areas. That doesn't exclude the possibility, however, than any given implementation actually has some hidden sharing going on.

One classic example is false-sharing - where independent variables fall in the same cache line, so logically independent writes to different variables from different threads end up thrashing the cache line between cores.

Another example, frequently encountered in practice, is contention via library - if your routines use malloc heavily, you may find that all the threads spend most of their time waiting on a lock inside the allocator as malloc is shared resource. This can be remedied by reducing reliance on malloc (perhaps via fewer, larger mallocs) or with a good concurrent malloc such as hoard or tcmalloc.

3) Implementation of the distribution and collection of the computation across threads may overwhelm the advantage you get from multiple threads. For example, if you spin up a new thread for every individual ray, the thread creation overhead would dominate your runtime and you would likely see a negative benefit. Even if you use a thread-pool of persistent threads, choosing a "work unit" that is too fine grained will impose a lot of coordination overhead which may eliminate your benefits.

Similarly, if you have to copy the input data to and from the worker threads, you may not see the scaling you expect. Where possible, use pass-by-reference for read-only data.

4) You don't have more than 1 core, or you do have more than 1 core but they are already occupied running other threads or processes. In these cases, the effort to coordinate multiple threads is pure overhead.

110

answered Oct 21 '22 22:10

BeeOnRope

In general, it depends. Given that the calculations are independent, it sounds like this is a good candidate for potential performance improvements due to threading. Ray calculations typically can benefit from this.

However, there are many other factors, such as memory access requirements, as well as the underlying system on which this runs, which will have a tremendous impact on this. It's often possible to have multithreaded versions run slower than single threaded versions if not written correctly, so profiling is the only way to answer this definitively.

answered Oct 21 '22 23:10

Reed Copsey

Related questions
                            
                                Is a global pointer initialized to zero?
                            
                                C++ is it legal to take the address of 2 or more past the end of the array?
                            
                                SFML error loadFromFile()
                            
                                Forcing the compiler to generate code for all member functions of a template class?
                            
                                How do I fix this error with allocation and deallocation mismatch?
                            
                                Why is “cast from ‘X*’ to ‘Y’ loses precision” a hard error and what is suitable fix for legacy code
                            
                                How to leave Visual Studio 2013 dlls dependencies behind?
                            
                                Need some clarification on #pragma once
                            
                                glutInitContextVersion() is missing from glut library
                            
                                Does mmap allocate a page or part of a page?
                            
                                C++: converting WCHAR to LPCWSTR - real working example
                            
                                Debugger does not step into MFC source code
                            
                                C++ exception handling in C codes
                            
                                How to get a pointer type's pointed type?
                            
                                C++: what are free-standing references used for?
                            
                                Launching a shell script when Button Pressed on GUI made with Qt
                            
                                OpenCV vs OpenVX for embedded system
                            
                                Why do I not get compiler warning about access uninitialized member variable in ctor?
                            
                                Why does my program give "NULL used in arithmetic" [closed]
                            
                                Why socket descriptors on Windows get such values?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Will multithreading improve performance significantly if I have a fixed amount of calculations that are independet from each other?

Tags:

c++

performance

multithreading

sdl

Marius Anderie

People also ask

2 Answers

BeeOnRope

Reed Copsey

Recent Activity

Donate For Us