x86-64 do address calculating mov i.e mov i(r, r, i), r execute on on port 1? Or is it still p0156?

Question

I am asking if mov instructions that need to compute that address i.e (in at&t syntax

mov i(r, r, i), reg or mov reg, i(r, reg, i)

have to be executed on port 1 because they are effectively an LEA w/ 3 operands + MOV or if they are free to be executed on port 0156.

If they do indeed execute the LEA portion on port 1, will port 1 be unblocked once the address computation is complete or will the entire memory load need to complete first.

On ICL it seems p7 can do indexed address mode?

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>


#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))


#define TERMS 3

void BENCH_ATTR
test_store_port() {
    const uint32_t N = (1 << 29);

    uint64_t dst, loop_cnt;
    uint64_t src[16] __attribute__((aligned(64)));

    asm volatile(
        "movl %[N], %k[loop_cnt]
	"
        ".p2align 5
	"
        "1:
	"

        "movl %k[loop_cnt], %k[dst]
	"
        "andl $15, %k[dst]
	"
#if TERMS == 3
        "movl %k[dst], (%[src], %[dst], 4)
	"
#else
        "movl %k[dst], (%[src])
	"
#endif


        "decl %k[loop_cnt]
	"
        "jnz 1b
	"
        : [ dst ] "+r"(dst), [ loop_cnt ] "+r"(loop_cnt)
        : [ N ] "i"(N), [ src ] "r"(src), "m"(*((const uint32_t(*)[16])src))
        : "cc");
}

int
main(int argc, char ** argv) {
    test_store_port();
}

Results with #define TERMS 3:

perf stat -e uops_dispatched.port_2_3 -e uops_dispatched.port_7_8 -e uops_issued.any -e cpu-cycles ./bsf_dep

 Performance counter stats for './bsf_dep':

           297,191      uops_dispatched.port_2_3                                    
       537,039,830      uops_dispatched.port_7_8                                    
     2,149,098,661      uops_issued.any                                             
       761,661,276      cpu-cycles                                                  

       0.210463841 seconds time elapsed

       0.210366000 seconds user
       0.000000000 seconds sys

Results with #define TERMS 1:

perf stat -e uops_dispatched.port_2_3 -e uops_dispatched.port_7_8 -e uops_issued.any -e cpu-cycles ./bsf_dep

 Performance counter stats for './bsf_dep':

           291,370      uops_dispatched.port_2_3                                    
       537,040,822      uops_dispatched.port_7_8                                    
     2,148,947,408      uops_issued.any                                             
       761,476,510      cpu-cycles                                                  

       0.202235307 seconds time elapsed

       0.202209000 seconds user
       0.000000000 seconds sys

Peter Cordes · Accepted Answer

All CPUs do address-generation for load / store uops on AGUs in the load or store-address ports, not on the ALU ports. Only LEA ever uses the ALU execution ports for that shift-and-add math.

If complex addressing modes needed port 1, https://uops.info/ and/or https://agner.org/optimize/ would say so in their instruction tables. But they don't: loads only need p23, and stores only p237 for store-address + p4 for store-data.

https://www.realworldtech.com/haswell-cpu/5/ shows the load/AGU execution units on ports 2 and 3, and the store-AGU on port 7. That exact layout applies to Haswell through Skylake.

Actually just p23 for indexed stores; the simple store-address AGU (Haswell through Skylake) on port 7 can only handle reg+constant, meaning address-generation can be a bottleneck if you use indexed addressing modes in code that could otherwise sustain 2 loads + 1 store per clock.

(Early Sandybridge-family, SnB and IvB, would even un-laminate indexed stores, so there was a front-end cost, too.)

Ice Lake changed that, with 2 dedicated store AGUs on ports 7 and 8. Store-address uops can't borrow load AGUs anymore, so the store AGUs have to be full featured. https://uops.info/html-tp/ICL/MOV_M32_R32-Measurements.html confirms that stores with indexed addressing mode do run at 2/clock on ICL, so both of the store AGUs are full-featured. e.g. mov [r14+r13*1+0x4],r8d. (uops.info didn't test a scale factor > 1, but I'd assume both the store-AGUs are identical in which case they'd both handle it.)

Unfortunately it will be many years before HSW/SKL aren't important for tuning, since Intel is still selling Skylake-derived microarchitectures so they'll be a large part of the installed base for years, for desktop software.

x86-64 do address calculating mov i.e mov i(r, r, i), r execute on on port 1? Or is it still p0156?

Tags:

cpu-architecture

x86

assembly

intel

Noah

1 Answers

Peter Cordes

Recent Activity

Donate For Us

x86-64 do address calculating mov i.e mov i(r, r, i), r execute on on port 1? Or is it still p0156?

Tags:

cpu-architecture

x86

assembly

intel

Noah

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us