Background
I have an EP (Embarassingly Parallell) C application running four threads on my laptop which contains an intel i5 M 480 running at 2.67GHz. This CPU has two hyperthreaded cores.
The four threads execute the same code on different subsets of data. The code and data have no problems fitting in a few cache lines (fit entirely in L1 with room to spare). The code contains no divisions, is essentially CPU-bound, uses all available registers and does a few memory accesses (outside L1) to write results on completion of the sequence.
The compiler is mingw64 4.8.1 i e fairly recent. The best basic optimization level appears to be -O1 which results in four threads that complete faster than two. -O2 and higher run slower (two threads complete faster than four but slower than -O1) as does -Os. Every thread on average does 3.37 million sequences every second which comes out to about 780 clock cycles for each. On average every sequence performs 25.5 sub-operations or one per 30.6 cycles.
So what two hyperthreads do in parallell in 30.6 cycles one thread will do sequentially in 35-40 or 17.5-20 cycles each.
Where I am
I think what I need is generated code which isn't so dense/efficient that the two hyperthreads constantly collide over the local CPU's resources.
These switches work fairly well (when compiling module by module)
-O1 -m64 -mthreads -g -Wall -c -fschedule-insns
as do these when compiling one module which #includes all the others
-O1 -m64 -mthreads -fschedule-insns -march=native -g -Wall -c -fwhole-program
there is no discernible performance difference between the two.
Question
Has anyone experimented with this and achieved good results?
You say "I think what I need is generated code which isn't so dense/efficient that the two hyperthreads constantly collide over the local CPU's resources.". That's rather misguided.
Your CPU has a certain amount of resources. Code will be able to use some of the resources, but usually not all. Hyperthreading means you have two threads capable of using the resources, so a higher percentage of these resources will be used.
What you want is to maximise the percentage of resources that are used. Efficient code will use these resources more efficiently in the first place, and adding hyper threading can only help. You won't get that much of a speedup through hyper threading, but that is because you got the speedup already in single threaded code because it was more efficient. If you want bragging rights that hyper threading gave you a big speedup, sure, start with inefficient code. If you want maximum speed, start with efficient code.
Now if your code was limited by latencies, it means it could perform quite a few useless instructions without penalty. With hyper threading, these useless instructions actually cost. So for hyper threading, you want to minimise the number of instructions, especially those that were hidden by latencies and had no visible cost in single threaded code.
You could try locking each thread to a core using processor affinity. I've heard this can give you 15%-50% improved efficiency with some code. The saving being that when the processor context switch happens there is less changed in the caches etc.. This will work better on a machine that is just running your app.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With