Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Poor performance due to hyper-threading with OpenMP: how to bind threads to cores

I am developing large dense matrix multiplication code. When I profile the code it sometimes gets about 75% of the peak flops of my four core system and other times gets about 36%. The efficiency does not change between executions of the code. It either starts at 75% and continues with that efficiency or starts at 36% and continues with that efficiency.

I have traced the problem down to hyper-threading and the fact that I set the number of threads to four instead of the default eight. When I disable hyper-threading in the BIOS I get about 75% efficiency consistently (or at least I never see the drastic drop to 36%).

Before I call any parallel code I do omp_set_num_threads(4). I have also tried export OMP_NUM_THREADS=4 before I run my code but it seems to be equivalent.

I don't want to disable hyper-threading in the BIOS. I think I need to bind the four threads to the four cores. I have tested some different cases of GOMP_CPU_AFFINITY but so far I still have the problem that the efficiency is 36% sometimes. What is the mapping with hyper-threading and cores? E.g. do thread 0 and thread 1 correspond to the the same core and thread 2 and thread 3 another core?

How can I bind the threads to each core without thread migration so that I don't have to disable hyper-threading in the BIOS? Maybe I need to look into using sched_setaffinity?

Some details of my current system: Linux kernel 3.13, GCC 4.8,Intel Xeon E5-1620 (four physical cores, eight hyper-threads).

Edit: This seems to be working well so far

export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7"

or

export GOMP_CPU_AFFINITY="0-7"

Edit: This seems also to work well

export OMP_PROC_BIND=true

Edit: These options also work well (gemm is the name of my executable)

numactl -C 0,1,2,3 ./gemm

and

taskset -c 0,1,2,3 ./gemm
like image 834
Z boson Avatar asked Jun 23 '14 14:06

Z boson


People also ask

Does hyperthreading reduce single core performance?

Yes, disabling HT will give a small boost to single threaded performance. The cores are no longer divided and sharing work, so the one thread going to each core gets the complete attention of that core, so the single thread performance will slightly improve.

Does disabling hyperthreading improve performance?

There is no definitive answer to this question as it depends on the game and the system that it is being played on. Some games may see a slight increase in performance when hyperthreading is disabled, while others may not see any difference.

How hyperthreading can improve the execution of threads?

By enabling hyper-threading, the execution units can process instructions from two threads simultaneously, which means fewer execution units will be idle during each clock cycle. As a result, enabling hyper-threading may significantly boost system performance.

Does Hyper-Threading boost performance?

Two logical cores can work through tasks more efficiently than a traditional single-threaded core. By taking advantage of idle time when the core would formerly be waiting for other tasks to complete, Intel® Hyper-Threading Technology improves CPU throughput (by up to 30% in server applications3).


1 Answers

This isn't a direct answer to your question, but it might be worth looking in to: apparently, hyperthreading can cause your cache to thrash. Have you tried checking out valgrind to see what kind of issue is causing your problem? There might be a quick fix to be had from allocating some junk at the top of every thread's stack so that your threads don't end up kicking each others cache lines out.

It looks like your CPU is 4-way set associative so it's not insane to think that, across 8 threads, you might end up with some really unfortunately aligned accesses. If your matrices are aligned on a multiple of the size of your cache, and if you had pairs of threads accessing areas a cache-multiple apart, any incidental read by a third thread would be enough to start causing conflict misses.

For a quick test -- if you change your input matrices to something that's not a multiple of your cache size (so they're no longer aligned on a boundary) and your problems disappear, then there's a good chance that you're dealing with conflict misses.

like image 167
Patrick Collins Avatar answered Oct 10 '22 18:10

Patrick Collins