Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding arrays using YMM instructions using gcc

I want to run the following code (in Intel syntax) in gcc (AT&T syntax).

; float a[128], b[128], c[128];
; for (int i = 0; i < 128; i++) a[i] = b[i] + c[i];
; Assume that a, b and c are aligned by 32
      xor ecx, ecx              ; Loop counter i = 0
L:    vmovaps ymm0, [b+rcx]     ; Load 8 elements from b
      vaddps ymm0,ymm0,[c+rcx]  ; Add  8 elements from c    
      vmovaps [a+rcx], ymm0     ; Store result in a
      add ecx,32                ; 8 elements * 4 bytes = 32 
      cmp ecx, 512              ; 128 elements * 4 bytes = 512 
      jb L                      ;Loop

Code is from Optimizing subroutines in assembly language.

The code I've written so far is:

static inline void addArray(float *a, float *b, float *c) {
__asm__ __volatile__ (
    "nop                            \n"
    "xor        %%ecx, %%ecx        \n" //;Loop counter set to 0
    "loop: \n\t"
    "vmovaps    %1, %%ymm0          \n" //;Load 8 elements from b  <== WRONG
    "vaddps     %2, %%ymm0, %%ymm0  \n" //;Add  8 elements from c  <==WRONG
    "vmovaps    %%ymm0, %0          \n" //;Store result in a
    "add        0x20, %%ecx         \n" //;8 elemtns * 4 bytes = 32 (0x20)
    "cmp        0x200,%%ecx         \n" //;128 elements * 4 bytes = 512 (0x200)
    "jb         loop                \n" //;Loop"
    "nop                            \n"
        : "=m"(a)                       //Outputs       
        : "m"(b), "m"(c)                //Inputs
        : "%ecx","%ymm0"                //Modifies ECX and YMM0
    );
}

The lines marked as "wrong" generate: (except from gdb disassemble)

0x0000000000000b78 <+19>:   vmovaps -0x10(%rbp),%ymm0
0x0000000000000b7d <+24>:   vaddps -0x18(%rbp),%ymm0,%ymm0

I want to get something like this (I guess):

vmovaps -0x10(%rbp,%ecx,%0x8),%ymm0

But I do not know how to specify %ecx as my index register.

Can you help me, please?

EDIT

I've tried (%1, %%ecx):

__asm__ __volatile__ (
    "nop                            \n"
    "xor        %%ecx, %%ecx        \n" //;Loop counter set to 0
    "loop: \n\t"
    "vmovaps    (%1, %%rcx), %%ymm0         \n" //;Load 8 elements from b  <== MODIFIED HERE
    "vaddps     %2, %%ymm0, %%ymm0  \n" //;Add  8 elements from c 
    "vmovaps    %%ymm0, %0          \n" //;Store result in a
    "add        0x20, %%ecx         \n" //;8 elemtns * 4 bytes = 32 (0x20)
    "cmp        0x200,%%ecx         \n" //;128 elements * 4 bytes = 512 (0x200)
    "jb         loop                \n" //;Loop"
    "nop                            \n"
        : "=m"(a)                       //Outputs       
        : "m"(b), "m"(c)                //Inputs
        : "%ecx","%ymm0"                //Modifies ECX and YMM0
    );

And I got:

inline1.cpp: Assembler messages:
inline1.cpp:90: Error: found '(', expected: ')'
inline1.cpp:90: Error: junk `(%rbp),%rcx)' after expression
like image 692
Chocksmith Avatar asked Apr 07 '26 06:04

Chocksmith


1 Answers

I don't think it is possible to translate this literally into GAS inline assembly. In AT&T syntax, the syntax is:

displacement(base register, offset register, scalar multiplier)

which would produce something akin to:

movl  -4(%ebp, %ecx, 4), %eax

or in your case:

vmovaps  -16(%rsp, %ecx, 0), %ymm0

The problem is, when you use a memory constraint (m), the inline assembler is going to emit the following wherever you write %n (where n is the number of the input/output):

-16(%rsp)

There is no way to manipulate the above into the form you actually want. You can write:

(%1, %%rcx)

but this will produce:

(-16(%rsp),%rcx)

which is clearly wrong. There is no way to get the offset register inside of those parentheses, where it belongs, since %n is emitting the whole -16(%rsp) as a chunk.

Of course, this is not really an issue, since you write inline assembly to get speed, and there's nothing speedy about loading from memory. You should have the inputs in a register, and when you use a register constraint for the input/output (r), you don't have a problem. Notice that this will require modifying your code slightly

Other things wrong with your inline assembly include:

  1. Numeric literals begin with $.
  2. Instructions should have size suffixes, like l for 32-bit and q for 64-bit.
  3. You are clobbering memory when you write through a, so you should have a memory clobber.
  4. The nop instructions at the beginning and the end are completely pointless. They aren't even aligning the branch target.
  5. Every line should really end with a tab character (\t), in addition to a new-line (\n), so that you get proper alignment when you inspect the disassembly.

Here is my version of the code:

void addArray(float *a, float *b, float *c) {
__asm__ __volatile__ (
    "xorl       %%ecx, %%ecx                \n\t" // Loop counter set to 0
    "loop:                                  \n\t"
    "vmovaps    (%1,%%rcx), %%ymm0          \n\t" // Load 8 elements from b
    "vaddps     (%2,%%rcx), %%ymm0, %%ymm0  \n\t" // Add  8 elements from c
    "vmovaps    %%ymm0, (%0,%%rcx)          \n\t" // Store result in a
    "addl       $0x20,  %%ecx               \n\t" // 8 elemtns * 4 bytes = 32 (0x20)
    "cmpl       $0x200, %%ecx               \n\t" // 128 elements * 4 bytes = 512 (0x200)
    "jb         loop"                             // Loop"
        :                                         // Outputs       
        : "r" (a), "r" (b), "r" (c)               // Inputs
        : "%ecx", "%ymm0", "memory"               // Modifies ECX, YMM0, and memory
    );
}

This causes the compiler to emit the following:

addArray(float*, float*, float*):
        xorl       %ecx, %ecx                
     loop:                                  
        vmovaps    (%rsi,%rcx), %ymm0           # b
        vaddps     (%rdx,%rcx), %ymm0, %ymm0    # c
        vmovaps    %ymm0, (%rdi,%rcx)           # a
        addl       $0x20,  %ecx               
        cmpl       $0x200, %ecx               
        jb         loop
        vzeroupper
        retq

Or, in the more familiar Intel syntax:

addArray(float*, float*, float*):
        xor     ecx, ecx
   loop:
        vmovaps ymm0, YMMWORD PTR [rsi + rcx]
        vaddps  ymm0, ymm0, YMMWORD PTR [rdx + rcx]
        vmovaps YMMWORD PTR [rdi + rcx], ymm0
        add     ecx, 32
        cmp     ecx, 512
        jb      loop
        vzeroupper
        ret

In the System V 64-bit calling convention, the first three parameters are passed in the rdi, rsi, and rdx registers, so the code doesn't need to move the parameters into registers—they are already there.

But you are not using input/output constraints to their fullest. You don't need rcx to be used as the counter. Nor do you need to use ymm0 as the scratch register. If you let the compiler pick which free registers to use, it will make the code more efficient. You also won't need to provide an explicit clobber list:

#include <stdint.h>
#include <x86intrin.h>

void addArray(float *a, float *b, float *c) {
  uint64_t temp = 0;
  __m256   ymm;
  __asm__ __volatile__(
      "loop:                        \n\t"
      "vmovaps    (%3,%0), %1       \n\t" // Load 8 elements from b
      "vaddps     (%4,%0), %1, %1   \n\t" // Add  8 elements from c
      "vmovaps    %1, (%2,%0)       \n\t" // Store result in a
      "addl       $0x20,  %0        \n\t" // 8 elemtns * 4 bytes = 32 (0x20)
      "cmpl       $0x200, %0        \n\t" // 128 elements * 4 bytes = 512 (0x200)
      "jb         loop"                   // Loop
    : "+r" (temp), "=x" (ymm)
    : "r" (a), "r" (b), "r" (c)
    : "memory"
    );
}

Of course, as has been mentioned in the comments, this entire exercise is a waste of time. GAS-style inline assembly, although powerful, is exceedingly difficult to write correctly (I'm not even 100% positive that my code here is correct!), so you should not write anything using inline assembly that you absolutely don't have to. And this is certainly not a case where you have to—the compiler will optimize the addition loop automatically:

void addArray(float *a, float *b, float *c) {
  for (int i = 0; i < 128; i++) a[i] = b[i] + c[i];
}

With -O2 and -mavx2, GCC compiles this to the following:

addArray(float*, float*, float*):
        xor     eax, eax
   .L2:
        vmovss  xmm0, DWORD PTR [rsi+rax]
        vaddss  xmm0, xmm0, DWORD PTR [rdx+rax]
        vmovss  DWORD PTR [rdi+rax], xmm0
        add     rax, 4
        cmp     rax, 512
        jne     .L2
        rep ret

Well, that looks awfully familiar, doesn't it? To be fair, it isn't vectorized like your code is. You can get that by using -O3 or -ftree-vectorize, but you also get a lot more code generated, so I'd need a benchmark to convince me that it was actually faster and worth the explosion in code size. But most of this is to handle cases where the input isn't aligned—if you indicate that it is aligned and that the pointer is restricted, that solves these problems and improves the code generation substantially. Notice that it is completely unrolling the loop, as well as vectorizing the addition.

like image 53
Cody Gray Avatar answered Apr 08 '26 19:04

Cody Gray