I wanted to run some code through IACA analyzer to see how many uops it was using-- I started with a simple function to see if it was working..
Unfortunately when I insert the macros IACA says to use, the resulting assembly was very different, and so any analysis of it is not helpful..
Here is the assembly produced without IACA
00007FF9CD590580 vaddps ymm1,ymm5,ymmword ptr [rax]
00007FF9CD590584 vaddps ymm2,ymm6,ymmword ptr [rax+20h]
00007FF9CD590589 vaddps ymm3,ymm7,ymmword ptr [rax+40h]
00007FF9CD59058E vmulps ymm4,ymm1,ymm1
00007FF9CD590592 vfmadd231ps ymm4,ymm2,ymm2
00007FF9CD590597 vfmadd231ps ymm4,ymm3,ymm3
00007FF9CD59059C vcmpgt_oqps ymm1,ymm4,ymm9
00007FF9CD5905A2 vrsqrtps ymm0,ymm4
00007FF9CD5905A6 vandps ymm2,ymm1,ymm0
00007FF9CD5905AA vmovups ymm3,ymm8
00007FF9CD5905AF vfmsub231ps ymm3,ymm2,ymm4
00007FF9CD5905B4 vmovups ymmword ptr [r9+rax],ymm3
00007FF9CD5905BA add rax,rcx
00007FF9CD5905BD sub r8d,1
00007FF9CD5905C1 jne fm::EvlOp::applyLoop<`RegisterShapeOps<fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> > >'::`2'::doDISTANCE_SPHERE_11,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::DataWrapper,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::RegisterBlock,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::instruction_input>+0B0h (07FF9CD590580h)
And here is what it produces once I add the IACA macros..( I'm testing MSVC produced binary, so I'm using IACA_VC64_START and IACA_VC64_END as the manual says to do).
00007FF9CD59058B vmovups ymm2,ymmword ptr [rax+40h]
00007FF9CD590590 vmovups ymm0,ymmword ptr [rax]
00007FF9CD590594 vmovups ymm1,ymmword ptr [rax+20h]
00007FF9CD590599 vaddps ymm3,ymm2,ymm8
00007FF9CD59059E vmovups ymmword ptr [rbp+20h],ymm0
00007FF9CD5905A3 vaddps ymm0,ymm0,ymm6
00007FF9CD5905A7 vmovups ymmword ptr [rbp+40h],ymm1
00007FF9CD5905AC vmulps ymm4,ymm0,ymm0
00007FF9CD5905B0 vaddps ymm1,ymm1,ymm7
00007FF9CD5905B4 vfmadd231ps ymm4,ymm1,ymm1
00007FF9CD5905B9 vfmadd231ps ymm4,ymm3,ymm3
00007FF9CD5905BE vcmpgt_oqps ymm1,ymm4,ymm5
00007FF9CD5905C3 vrsqrtps ymm0,ymm4
00007FF9CD5905C7 vmovups ymmword ptr [rbp+60h],ymm2
00007FF9CD5905CC vandps ymm2,ymm1,ymm0
00007FF9CD5905D0 vmovups ymm3,ymm9
00007FF9CD5905D5 vfmsub231ps ymm3,ymm2,ymm4
00007FF9CD5905DA vmovups ymmword ptr [rcx+rax],ymm3
00007FF9CD5905DF add rax,rdx
00007FF9CD5905E2 mov qword ptr [rbp+18h],rax
00007FF9CD5905E6 vmovups ymmword ptr [rbp+80h],ymm3
00007FF9CD5905EE sub r8d,1
00007FF9CD5905F2 jne fm::EvlOp::applyLoop<`RegisterShapeOps<fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> > >'::`2'::doDISTANCE_SPHERE_11,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::DataWrapper,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::RegisterBlock,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::instruction_input>+0B2h (07FF9CD590582h)
So it has inserted lots of moves, and now my (hopefully) fused add is not longer fused--..
I was hoping it would be able to tell me if
00007FF9CD590584 vaddps ymm2,ymm6,ymmword ptr [rax+20h]
Stayed fused, but it removed this code all together..
Is this a known issue, or perhaps because I'm using MSVC which may not be very common?
Is there perhaps a way to fix this, or a better tool that is compatible with MSVC?
IACA mark macros are just inline asm (or for 64-bit MSVC: start = __writegsbyte(111, 111);
and stop = 222
). They can potentially disturb the optimizer, or end up in the wrong place (e.g. not the last instruction before falling into a loop, so the block includes some loop setup).
If that happens, like in your case, your best bet is to ask the compiler to produce asm (not machine code) output, and manually insert the markers into the asm you want to analyze.
In NASM syntax, I use this %if
/ %else
block so I can build with nasm -DIACA_MARKS
or not. I know this isn't the right syntax for MASM, but the IACA start/end markers are pretty simple: mov
to EBX and fs addr32 nop
.
%ifdef IACA_MARKS
%macro IACA_start 0 ; NASM macro with 0 args, defines IACA_start
mov ebx, 111
db 0x64, 0x67, 0x90
%endmacro
%macro IACA_end 0
mov ebx, 222
db 0x64, 0x67, 0x90
%endmacro
%else
%define IACA_start
%define IACA_end
%endif
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With