Copying 64 bytes of memory with NT stores to one full cache line vs. 2 consecutive partial cache lines

Question

I'm reading Intel Optimization Manual about Write Combining memory and wrote benchmarks to understand how it works. These are 2 functions that I'm running benchmarks on:

memcopy.h:

void avx_ntcopy_cache_line(void *dest, const void *src);

void avx_ntcopy_64_two_cache_lines(void *dest, const void *src);

memcopy.S:

avx_ntcopy_cache_line:
    vmovdqa ymm0, [rdi]
    vmovdqa ymm1, [rdi + 0x20]
    vmovntdq [rsi], ymm0
    vmovntdq [rsi + 0x20], ymm1
    ;intentionally no sfence after nt-store
    ret

avx_ntcopy_64_two_cache_lines:
    vmovdqa ymm0, [rdi]
    vmovdqa ymm1, [rdi + 0x40]
    vmovntdq [rsi], ymm0
    vmovntdq [rsi + 0x40], ymm1
    ;intentionally no sfence after nt-store
    ret

Here is how benchmark's main function looks like:

#include <stdlib.h>
#include <inttypes.h>
#include <x86intrin.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include "memcopy.h"

#define ITERATIONS 1000000

//As @HadiBrais noted, there might be an issue with 4K aliasing
_Alignas(64) char src[128];
_Alignas(64) char dest[128];

static void run_benchmark(unsigned runs, unsigned run_iterations,
                    void (*fn)(void *, const void*), void *dest, const void* src);

int main(void){
    int fd = open("/dev/urandom", O_RDONLY);
    read(fd, src, sizeof src);

    run_benchmark(20, ITERATIONS, avx_ntcopy_cache_line, dest, src);
    run_benchmark(20, ITERATIONS, avx_ntcopy_64_two_cache_lines, dest, src);
}

static int uint64_compare(const void *u1, const void *u2){
    uint64_t uint1 = *(uint64_t *) u1;
    uint64_t uint2 = *(uint64_t *) u2;
    if(uint1 < uint2){
        return -1;
    } else if (uint1 == uint2){
        return 0;
    } else {
        return 1;
    }
}

static inline uint64_t benchmark_2cache_lines_copy_function(unsigned iterations, void (*fn)(void *, const void *),
                                               void *restrict dest, const void *restrict src){
    uint64_t *results = malloc(iterations * sizeof(uint64_t));
    unsigned idx = iterations;
    while(idx --> 0){
        uint64_t start = __rdpmc((1<<30)+1);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        uint64_t finish = __rdpmc((1<<30)+1);
        results[idx] = (finish - start) >> 4;
    }
    qsort(results, iterations, sizeof *results, uint64_compare);
    //median
    return results[iterations >> 1];
}

static void run_benchmark(unsigned runs, unsigned run_iterations,
                    void (*fn)(void *, const void*), void *dest, const void* src){
    unsigned current_run = 1;
    while(current_run <= runs){
        uint64_t time = benchmark_2cache_lines_copy_function(run_iterations, fn, dest, src);
        printf("Run %d result: %lu
", current_run, time);
        current_run++;
    }
}

Compiling with options

-Werror \
-Wextra
-Wall \
-pedantic \
-Wno-stack-protector \
-g3 \
-O3 \
-Wno-unused-result \
-Wno-unused-parameter

And running the benchmarks I got the following results:

I. avx_ntcopy_cache_line:

Run 1 result: 61
Run 2 result: 61
Run 3 result: 61
Run 4 result: 61
Run 5 result: 61
Run 6 result: 61
Run 7 result: 61
Run 8 result: 61
Run 9 result: 61
Run 10 result: 61
Run 11 result: 61
Run 12 result: 61
Run 13 result: 61
Run 14 result: 61
Run 15 result: 61
Run 16 result: 61
Run 17 result: 61
Run 18 result: 61
Run 19 result: 61
Run 20 result: 61

perf:

 Performance counter stats for './bin':

     3 503 775 289      L1-dcache-loads                                               (18,87%)
        91 965 805      L1-dcache-load-misses     #    2,62% of all L1-dcache hits    (18,94%)
     2 041 496 256      L1-dcache-stores                                              (19,01%)
         5 461 440      LLC-loads                                                     (19,08%)
         1 108 179      LLC-load-misses           #   20,29% of all LL-cache hits     (19,10%)
        18 028 817      LLC-stores                                                    (9,55%)
       116 865 915      l2_rqsts.all_pf                                               (14,32%)
                 0      sw_prefetch_access.t1_t2                                      (19,10%)
           666 096      l2_lines_out.useless_hwpf                                     (19,10%)
        47 701 696      l2_rqsts.pf_hit                                               (19,10%)
        62 556 656      l2_rqsts.pf_miss                                              (19,10%)
         4 568 231      load_hit_pre.sw_pf                                            (19,10%)
        17 113 190      l2_rqsts.rfo_hit                                              (19,10%)
        15 248 685      l2_rqsts.rfo_miss                                             (19,10%)
        54 460 370      LD_BLOCKS_PARTIAL.ADDRESS_ALIAS                                     (19,10%)
    18 469 040 693      uops_retired.stall_cycles                                     (19,10%)
    16 796 868 661      uops_executed.stall_cycles                                     (19,10%)
    18 315 632 129      uops_issued.stall_cycles                                      (19,05%)
    16 176 115 539      resource_stalls.sb                                            (18,98%)
    16 424 440 816      resource_stalls.any                                           (18,92%)
    22 692 338 882      cycles                                                        (18,85%)

       5,780512545 seconds time elapsed

       5,740239000 seconds user
       0,040001000 seconds sys

II. avx_ntcopy_64_two_cache_lines:

Run 1 result: 6
Run 2 result: 6
Run 3 result: 6
Run 4 result: 6
Run 5 result: 6
Run 6 result: 6
Run 7 result: 6
Run 8 result: 6
Run 9 result: 6
Run 10 result: 6
Run 11 result: 6
Run 12 result: 6
Run 13 result: 6
Run 14 result: 6
Run 15 result: 6
Run 16 result: 6
Run 17 result: 6
Run 18 result: 6
Run 19 result: 6
Run 20 result: 6

perf:

 Performance counter stats for './bin':

     3 095 792 486      L1-dcache-loads                                               (19,26%)
        82 194 718      L1-dcache-load-misses     #    2,66% of all L1-dcache hits    (18,99%)
     1 793 291 250      L1-dcache-stores                                              (19,00%)
         4 612 503      LLC-loads                                                     (19,01%)
           975 438      LLC-load-misses           #   21,15% of all LL-cache hits     (18,94%)
        15 707 916      LLC-stores                                                    (9,47%)
        97 928 734      l2_rqsts.all_pf                                               (14,20%)
                 0      sw_prefetch_access.t1_t2                                      (19,21%)
           532 203      l2_lines_out.useless_hwpf                                     (19,19%)
        35 394 752      l2_rqsts.pf_hit                                               (19,20%)
        56 303 030      l2_rqsts.pf_miss                                              (19,20%)
         6 197 253      load_hit_pre.sw_pf                                            (18,93%)
        13 458 517      l2_rqsts.rfo_hit                                              (18,94%)
        14 031 767      l2_rqsts.rfo_miss                                             (18,93%)
        36 406 273      LD_BLOCKS_PARTIAL.ADDRESS_ALIAS                                     (18,94%)
     2 213 339 719      uops_retired.stall_cycles                                     (18,93%)
     1 225 185 268      uops_executed.stall_cycles                                     (18,94%)
     1 943 649 682      uops_issued.stall_cycles                                      (18,94%)
       126 401 004      resource_stalls.sb                                            (19,20%)
       202 537 285      resource_stalls.any                                           (19,20%)
     5 676 443 982      cycles                                                        (19,18%)

       1,521271014 seconds time elapsed

       1,483660000 seconds user
       0,032253000 seconds sys

As can be seen, there is 10 times difference in measurement results.

My Interpretation:

As explained in Intel Optimization Manual/3.6.9:

writes to different parts of the same cache line can be grouped into a single, full-cache-line bus transaction instead of going across the bus (since they are not cached) as several partial writes

I assumed that in the case of avx_ntcopy_cache_line we've got the full 64-bytes write initiating the bus transaction to write them out which prohibits rdtsc to be executed out of order.

By contrast, in the case of avx_ntcopy_64_two_cache_lines we've got 32 bytes written into different cache lines going to WC-buffer and bus transaction was not triggered. This allowed rdtsc to be executed out of order.

This interpretation looks extremely suspicious and it does not go along with bus-cycles difference:

avx_ntcopy_cache_line: 131 454 700

avx_ntcopy_64_two_cache_lines: 31 957 050

QUESTION: What is the true cause of such difference in measurement?

Peter Cordes · Accepted Answer

Hypothesis: a (fully) overlapping store to a not-yet-flushed WC buffer can just merge into it. Completing a line triggers an immediate flush, and all those stores going all the way off core is slow.

You report 100x more resource_stalls.sb for the full-line version than for the 2 partial line version. That's consistent with this explanation.

If 2_lines can commit the NT stores into existing WC buffers (LFBs), the store buffer can keep up with the rate of store instructions executing, usually bottlenecking on something else. (Probably just the front-end, given the call/ret overhead for each pair of loads/stores. Although of course call does include a store.) Your perf results show 1.8 billion stores (to L1) over 5.7 billion cycles, so well within the 1 store/cycle limit we might expect for stores hitting in the WC buffer.

But if WC buffers get flushed, which happens when a line is fully written, it has to go off core (which is slow), tying up that LFB for a while so it can't be used to commit later NT stores. When stores can't leave the store buffer, it fills up and the core stalls on being able to allocate resources for new store instructions to enter the back-end. (Specifically the issue/rename/allocate stage stalls.)

You could probably see this this effect more clearly with any of the L2, L3, SQ, offcore req/resp events that would pick up all this traffic outside of the L1. You include some L2 counters, but those probably don't pick up NT store that pass through L2.

Enhanced REP MOVSB for memcpy suggests that NT stores take longer for the LFB to "hand off" to outer levels of the memory hierarchy, keeping the LFB occupied long after the request starts its journey. (Perhaps to make sure a core can always reload what it just stored, or otherwise not losing track of an in-flight NT store to maintain coherency with MESI.) A later sfence also needs to know when earlier NT stores have become visible to other cores, so we can't have them invisible at any point before that.

Even if that's not the case, there's still going to be a throughput bottleneck somewhere for all those NT store requests. So the other possible mechanism is that they fill up some buffer and then the core can't hand off LFBs anymore, so it runs out of LFBs to commit NT stores into, and then the SB fills stalling allocation.

They might merge once they get to the memory controller without each one needing a burst transfer over the actual external memory bus, but the path from a core through the uncore to a memory controller is not short.

Even doing 2x rdpmc for every 32 stores doesn't slow the CPU down enough to prevent the store buffer from filling; what you're seeing depends on running this in a relatively tight loop, not a one-shot execution with an empty store buffer to start with. Also, your suggestion that rdpmc or rdtsc won't be reordered wrt. the WC buffers flushing makes zero sense. Execution of stores isn't ordered wrt. execution of rdtsc.

TL:DR: your rdpmc to time an individual group of stores isn't helpful, and if anything hides some of the perf difference by slowing down the fast case that doesn't bottleneck on the store buffer.

Copying 64 bytes of memory with NT stores to one full cache line vs. 2 consecutive partial cache lines

Tags:

performance

c

x86

assembly

avx

St.Antario

1 Answers

Peter Cordes

Recent Activity

Donate For Us

Copying 64 bytes of memory with NT stores to one full cache line vs. 2 consecutive partial cache lines

Tags:

performance

c

x86

assembly

avx

St.Antario

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us