Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hyper-threading... made my renderer 10 times slower

Executive summary: How can one specify in his code that OpenMP should only use threads for the REAL cores, i.e. not count the hyper-threading ones?

Detailed analysis: Over the years, I've coded a SW-only, open source renderer (rasterizer/raytracer) in my free time. The GPL code and Windows binaries are available from here: https://www.thanassis.space/renderer.html It compiles and runs fine under Windows, Linux, OS/X and the BSDs.

I introduced a raytracing mode this last month - and the quality of the generated pictures sky-rocketed. Unfortunately, raytracing is orders of magnitude slower than rasterizing. To increase speed, just as I did for the rasterizers, I added OpenMP (and TBB) support to the raytracer - to easily make use of additional CPU cores. Both rasterizing and raytracing are easily amenable to threading (work per triangle - work per pixel).

At home, with my Core2Duo, the 2nd core helped all the modes - both the rasterizing and the raytracing modes got a speedup that is between 1.85x and 1.9x.

The problem: Naturally, I was curious to see the top CPU performance (I also "play" with GPUs, preliminary CUDA port), so I wanted a solid base for comparisons. I gave the code to a good friend of mine, who has access to a "beast" machine, with a 16-core, 1500$ Intel super processor.

He runs it in the "heaviest" mode, the raytracer mode...

...and he gets one fifth the speed of my Core2Duo (!)

Gasp - horror. What just happened?

We started trying different modifications, patches, ... and eventually we figured it out.

By using the OMP_NUM_THREADS environment variable, one can control how many OpenMP threads are spawned. As the number of threads was increasing from 1 up to 8, the speed was increasing (close to a linear increase). The moment we crossed 8, speed started to diminish, until it nose-dived to one fifth the speed of my Core2Duo, when all 16 cores were used!

Why 8?

Because 8 was the number of the real cores. The other 8 were... hyperthreading ones!

The theory: Now, this was news to me - I've seen hyper-threading help a lot (up to 25%) in other algorithms, so this was unexpected. Apparently, even though each hyper-threading core comes with its own registers (and SSE unit?), the raytracer could not make use of the extra processing power. Which lead me to think...

It is probably not processing power that is starved - it is memory bandwidth.

The raytracer uses a bounding volume hierarchy data structure, to accelerate ray-triangle intersections. If the hyperthreaded cores are used, then each of the "logical cores" in a pair, is trying to read from different places in that data structure (i.e. in memory) - and the CPU caches (local per pair) are completely thrashed. At least, that's my theory - any suggestions most welcome.

So, the question: OpenMP detects the number of "cores" and spawns threads to match it - that is, it includes the hyperthreaded "cores" in the calculation. In my case, this apparently leads to disastrous results, speed-wise. Does anyone know how to use the OpenMP API (if possible, portably) to only spawn threads for the REAL cores, and not the hyperthreaded ones?

P.S. The code is open (GPL) and available at the link above, feel free to reproduce on your own machine - I am guessing this will happen in all hyperthreaded CPUs.

P.P.S. Excuse the length of the post, I thought it was an educational experience and wanted to share.

like image 557
ttsiodras Avatar asked Jan 27 '11 14:01

ttsiodras


People also ask

Does hyperthreading decrease performance?

By dividing the CPU into threads and assigning each thread a specific task, hyperthreading reduces the workload on the CPU. In fact, hyperthreading doesn't increase the speed of the CPU. It increases the number of tasks it can do at a given time, increasing its performance.

Is hyperthreading good for rendering?

Glorious. Yes it reduces render times and yes it increases cpu heat. Your less utilized cores for the game are used in Hyperthreading for the rendering.

Does Hyper-Threading increase speed?

According to Intel [1], hyper-threading your cores can result in a 30% increase in performance and speed when comparing two identical PCs, with one CPU hyper-threaded. In a study published on Forbes, hyper-threading an AMD® processor (Ryzen 5 1600) showed a 17% increase in overall processing performance [2].

Should I turn hyper-threading off?

There has been some speculation that hyperthreading on Intel CPU can make your system vulnerable to hacks. Intel claims that this is not the case. But regardless of security issues, it's best to disable this feature if you want to avoid straining from your CPU.


2 Answers

Basically, you need some fairly portable way of querying the environment for fairly low-level hardware details - and generally, you can't do that from just system calls (the OS is generally unaware even of the difference between hardware threads and cores).

One library which supports a number of platforms is hwloc - supports Linux & windows (and others), intel & amd chips. Hwloc will let you find everything out about the hardware topology, and knows the difference between cores and hardware threads (called PUs - processing units - in hwloc terminology). So you'd call this library at the start, find the number of actual cores, and call omp_set_num_threads() (or just add that variable as a directive at the start of parallel sections).

like image 161
Jonathan Dursi Avatar answered Oct 06 '22 00:10

Jonathan Dursi


Unfortunately your assumption about why this is occurring is most likely correct. To be sure, you would have to use a profile tool - but I have seen this before with raytracing, so it is not surprising. In any case, there is currently no way to determine from OpenMP that some of the processors are "real" and some are hyperthreaded. You could write some code to determine this and then set the number yourself. However, there would still be the problem that OpenMP doesn't schedule the threads on the processors itself - it allows the OS to do that.

There has been work in the OpenMP ARB language committee to try and define a standard way for the user to determine his environment and say how to run. At this time, this discussion is still raging on. Many implementations allow you to "bind" the threads to the processors, by use of an implementation defined environment variable. However, the user has to know the processor numbering and which processors are "real" vs. hyperthreaded.

like image 21
ejd Avatar answered Oct 05 '22 23:10

ejd