Edit 3: The images are links to the full-size versions. Sorry for the pictures-of-text, but the graphs would be hard to copy/paste into a text table.
I have the following VTune profile for a program compiled with icc --std=c++14 -qopenmp -axS -O3 -fPIC
:
In that profile, two clusters of instructions are highlighted in the assembly view. The upper cluster takes significantly less time than the lower one, in spite of instructions being identical and in the same order. Both clusters are located inside the same function and are obviously both called n
times. This happens every time I run the profiler, on both a Westmere Xeon and a Haswell laptop that I'm using right now (compiled with SSE because that's what I'm targeting and learning right now).
What am I missing?
Ignore the poor concurrency, this is most probably due to the laptop throttling, since it doesn't occur on the desktop Xeon machine.
I believe this is not an example of micro-optimisation, since those three added together amount to a decent % of the total time, and I'm really interested about the possible cause of this behavior.
Edit: OMP_NUM_THREADS=1 taskset -c 1 /opt/intel/vtune...
Same profile, albeit with a slightly lower CPI this time.
HW perf counters typically charge stalls to the instruction that had to wait for its inputs, not the instruction that was slow producing outputs.
The inputs for your first group comes from your gather. This probably cache-misses a lot, and doesn't the costs aren't going to get charged to those SUBPS/MULPS/ADDPS instructions. Their inputs come directly from vector loads of voxel[]
, so store-forwarding failure will cause some latency. But that's only ~10 cycles IIRC, small compared to cache misses during the gather. (Those cache misses show up as large bars for the instructions right before the first group that you've highlighted)
The inputs for your second group come directly from loads that can miss in cache. In the first group, the direct consumers of the cache-miss loads were instructions for lines like the one that sets voxel[0]
, which has a really large bar.
But in the second group, the time for the cache misses in a_transfer[]
is getting attributed to the group you've highlighted. Or if it's not cache misses, then maybe it's slow address calculation as the loads have to wait for RAX to be ready.
It looks like there's a lot you could optimize here.
instead of store/reload for a_pointf
, just keep it hot across loop iterations in a __m128
variable. Storing/reloading in the C source only makes sense if you found the compiler was making a poor choice about which vector register to spill (if it ran out of registers).
calculate vi
with _mm_cvttps_epi32(vf)
, so the ROUNDPS isn't part of the dependency chain for the gather indices.
Do the voxel
gather yourself by shuffling narrow loads into vectors, instead of writing code that copies to an array and then loads from it. (guaranteed store-forwarding failure, see Agner Fog's optimization guides and other links from the x86 tag wiki).
It might be worth it to partially vectorize the address math (calculation of base_0
, using PMULDQ with a constant vector), so instead of a store/reload (~5 cycle latency) you just have a MOVQ or two (~1 or 2 cycle latency on Haswell, I forget.)
Use MOVD to load two adjacent short
values, and merge another pair into the second element with PINSRD. You'll probably get good code from _mm_setr_epi32(*(const int*)base_0, *(const int*)(base_0 + dim_x), 0, 0)
, except that pointer aliasing is undefined behaviour. You might get worse code from _mm_setr_epi16(*base_0, *(base_0 + 1), *(base_0 + dim_x), *(base_0 + dim_x + 1), 0,0,0,0)
.
Then expand the low four 16-bit elements into 32-bit elements integers with PMOVSX, and convert them all to float
in parallel with _mm_cvtepi32_ps
(CVTDQ2PS).
Your scalar LERPs aren't being auto-vectorized, but you're doing two in parallel (and could maybe save an instruction since you want the result in a vector anyway).
Calling floorf()
is silly, and a function call forces the compiler to spill all xmm registers to memory. Compile with -ffast-math
or whatever to let it inline to a ROUNDSS, or do that manually. Especially since you go ahead and load the float that you calculate from that into a vector!
Use a vector compare instead of scalar prev_x / prev_y / prev_z. Use MOVMASKPS to get the result into an integer you can test. (You only care about the lower 3 elements, so test it with compare_mask & 0b0111
(true if any of the low 3 bits of the 4-bit mask are set, after a compare for not-equal with _mm_cmpneq_ps
. See the double
version of the instruction for more tables on how it all works: http://www.felixcloutier.com/x86/CMPPD.html).
Well, analyzing assembly code please note that running time is attributed to the next instruction - so, the data you're looking by instructions need to be interpreted carefully. There is a corresponding note in VTune Release Notes:
Running time is attributed to the next instruction (200108041)
To collect the data about time-consuming running regions of the target, the Intel® VTune™ Amplifier interrupts executing target threads and attributes the time to the context IP address.
Due to the collection mechanism, the captured IP address points to an instruction AFTER the one that is actually consuming most of the time. This leads to the running time being attributed to the next instruction (or, rarely to one of the subsequent instructions) in the Assembly view. In rare cases, this can also lead to wrong attribution of running time in the source - the time may be erroneously attributed to the source line AFTER the actual hot line.
In case the inline mode is ON and the program has small functions inlined at the hotspots, this can cause the running time to be attributed to a wrong function since the next instruction can belong to a different function in tightly inlined code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With