I want to run the following code (in Intel syntax) in gcc (AT&T syntax).
; float a[128], b[128], c[128];
; for (int i = 0; i < 128; i++) a[i] = b[i] + c[i];
; Assume that a, b and c are aligned by 32
xor ecx, ecx ; Loop counter i = 0
L: vmovaps ymm0, [b+rcx] ; Load 8 elements from b
vaddps ymm0,ymm0,[c+rcx] ; Add 8 elements from c
vmovaps [a+rcx], ymm0 ; Store result in a
add ecx,32 ; 8 elements * 4 bytes = 32
cmp ecx, 512 ; 128 elements * 4 bytes = 512
jb L ;Loop
Code is from Optimizing subroutines in assembly language.
The code I've written so far is:
static inline void addArray(float *a, float *b, float *c) {
__asm__ __volatile__ (
"nop \n"
"xor %%ecx, %%ecx \n" //;Loop counter set to 0
"loop: \n\t"
"vmovaps %1, %%ymm0 \n" //;Load 8 elements from b <== WRONG
"vaddps %2, %%ymm0, %%ymm0 \n" //;Add 8 elements from c <==WRONG
"vmovaps %%ymm0, %0 \n" //;Store result in a
"add 0x20, %%ecx \n" //;8 elemtns * 4 bytes = 32 (0x20)
"cmp 0x200,%%ecx \n" //;128 elements * 4 bytes = 512 (0x200)
"jb loop \n" //;Loop"
"nop \n"
: "=m"(a) //Outputs
: "m"(b), "m"(c) //Inputs
: "%ecx","%ymm0" //Modifies ECX and YMM0
);
}
The lines marked as "wrong" generate: (except from gdb disassemble)
0x0000000000000b78 <+19>: vmovaps -0x10(%rbp),%ymm0
0x0000000000000b7d <+24>: vaddps -0x18(%rbp),%ymm0,%ymm0
I want to get something like this (I guess):
vmovaps -0x10(%rbp,%ecx,%0x8),%ymm0
But I do not know how to specify %ecx as my index register.
Can you help me, please?
EDIT
I've tried (%1, %%ecx):
__asm__ __volatile__ (
"nop \n"
"xor %%ecx, %%ecx \n" //;Loop counter set to 0
"loop: \n\t"
"vmovaps (%1, %%rcx), %%ymm0 \n" //;Load 8 elements from b <== MODIFIED HERE
"vaddps %2, %%ymm0, %%ymm0 \n" //;Add 8 elements from c
"vmovaps %%ymm0, %0 \n" //;Store result in a
"add 0x20, %%ecx \n" //;8 elemtns * 4 bytes = 32 (0x20)
"cmp 0x200,%%ecx \n" //;128 elements * 4 bytes = 512 (0x200)
"jb loop \n" //;Loop"
"nop \n"
: "=m"(a) //Outputs
: "m"(b), "m"(c) //Inputs
: "%ecx","%ymm0" //Modifies ECX and YMM0
);
And I got:
inline1.cpp: Assembler messages:
inline1.cpp:90: Error: found '(', expected: ')'
inline1.cpp:90: Error: junk `(%rbp),%rcx)' after expression
I don't think it is possible to translate this literally into GAS inline assembly. In AT&T syntax, the syntax is:
displacement(base register, offset register, scalar multiplier)
which would produce something akin to:
movl -4(%ebp, %ecx, 4), %eax
or in your case:
vmovaps -16(%rsp, %ecx, 0), %ymm0
The problem is, when you use a memory constraint (m), the inline assembler is going to emit the following wherever you write %n (where n is the number of the input/output):
-16(%rsp)
There is no way to manipulate the above into the form you actually want. You can write:
(%1, %%rcx)
but this will produce:
(-16(%rsp),%rcx)
which is clearly wrong. There is no way to get the offset register inside of those parentheses, where it belongs, since %n is emitting the whole -16(%rsp) as a chunk.
Of course, this is not really an issue, since you write inline assembly to get speed, and there's nothing speedy about loading from memory. You should have the inputs in a register, and when you use a register constraint for the input/output (r), you don't have a problem. Notice that this will require modifying your code slightly
Other things wrong with your inline assembly include:
$.l for 32-bit and q for 64-bit.a, so you should have a memory clobber.nop instructions at the beginning and the end are completely pointless. They aren't even aligning the branch target.\t), in addition to a new-line (\n), so that you get proper alignment when you inspect the disassembly.Here is my version of the code:
void addArray(float *a, float *b, float *c) {
__asm__ __volatile__ (
"xorl %%ecx, %%ecx \n\t" // Loop counter set to 0
"loop: \n\t"
"vmovaps (%1,%%rcx), %%ymm0 \n\t" // Load 8 elements from b
"vaddps (%2,%%rcx), %%ymm0, %%ymm0 \n\t" // Add 8 elements from c
"vmovaps %%ymm0, (%0,%%rcx) \n\t" // Store result in a
"addl $0x20, %%ecx \n\t" // 8 elemtns * 4 bytes = 32 (0x20)
"cmpl $0x200, %%ecx \n\t" // 128 elements * 4 bytes = 512 (0x200)
"jb loop" // Loop"
: // Outputs
: "r" (a), "r" (b), "r" (c) // Inputs
: "%ecx", "%ymm0", "memory" // Modifies ECX, YMM0, and memory
);
}
This causes the compiler to emit the following:
addArray(float*, float*, float*):
xorl %ecx, %ecx
loop:
vmovaps (%rsi,%rcx), %ymm0 # b
vaddps (%rdx,%rcx), %ymm0, %ymm0 # c
vmovaps %ymm0, (%rdi,%rcx) # a
addl $0x20, %ecx
cmpl $0x200, %ecx
jb loop
vzeroupper
retq
Or, in the more familiar Intel syntax:
addArray(float*, float*, float*):
xor ecx, ecx
loop:
vmovaps ymm0, YMMWORD PTR [rsi + rcx]
vaddps ymm0, ymm0, YMMWORD PTR [rdx + rcx]
vmovaps YMMWORD PTR [rdi + rcx], ymm0
add ecx, 32
cmp ecx, 512
jb loop
vzeroupper
ret
In the System V 64-bit calling convention, the first three parameters are passed in the rdi, rsi, and rdx registers, so the code doesn't need to move the parameters into registers—they are already there.
But you are not using input/output constraints to their fullest. You don't need rcx to be used as the counter. Nor do you need to use ymm0 as the scratch register. If you let the compiler pick which free registers to use, it will make the code more efficient. You also won't need to provide an explicit clobber list:
#include <stdint.h>
#include <x86intrin.h>
void addArray(float *a, float *b, float *c) {
uint64_t temp = 0;
__m256 ymm;
__asm__ __volatile__(
"loop: \n\t"
"vmovaps (%3,%0), %1 \n\t" // Load 8 elements from b
"vaddps (%4,%0), %1, %1 \n\t" // Add 8 elements from c
"vmovaps %1, (%2,%0) \n\t" // Store result in a
"addl $0x20, %0 \n\t" // 8 elemtns * 4 bytes = 32 (0x20)
"cmpl $0x200, %0 \n\t" // 128 elements * 4 bytes = 512 (0x200)
"jb loop" // Loop
: "+r" (temp), "=x" (ymm)
: "r" (a), "r" (b), "r" (c)
: "memory"
);
}
Of course, as has been mentioned in the comments, this entire exercise is a waste of time. GAS-style inline assembly, although powerful, is exceedingly difficult to write correctly (I'm not even 100% positive that my code here is correct!), so you should not write anything using inline assembly that you absolutely don't have to. And this is certainly not a case where you have to—the compiler will optimize the addition loop automatically:
void addArray(float *a, float *b, float *c) {
for (int i = 0; i < 128; i++) a[i] = b[i] + c[i];
}
With -O2 and -mavx2, GCC compiles this to the following:
addArray(float*, float*, float*):
xor eax, eax
.L2:
vmovss xmm0, DWORD PTR [rsi+rax]
vaddss xmm0, xmm0, DWORD PTR [rdx+rax]
vmovss DWORD PTR [rdi+rax], xmm0
add rax, 4
cmp rax, 512
jne .L2
rep ret
Well, that looks awfully familiar, doesn't it? To be fair, it isn't vectorized like your code is. You can get that by using -O3 or -ftree-vectorize, but you also get a lot more code generated, so I'd need a benchmark to convince me that it was actually faster and worth the explosion in code size. But most of this is to handle cases where the input isn't aligned—if you indicate that it is aligned and that the pointer is restricted, that solves these problems and improves the code generation substantially. Notice that it is completely unrolling the loop, as well as vectorizing the addition.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With