Whiskey Lake i7-8565U
I'm trying to learn how to write benchmarks in a single shot by hands (without using any benchmarking frameworks) on an example of memory copy routine with regular and NonTemporal writes to WB memory and would like to ask for some sort of review.
Declaration:
void *avx_memcpy_forward_llss(void *restrict, const void *restrict, size_t);
void *avx_nt_memcpy_forward_llss(void *restrict, const void *restrict, size_t);
Definition:
avx_memcpy_forward_llss:
shr rdx, 0x3
xor rcx, rcx
avx_memcpy_forward_loop_llss:
vmovdqa ymm0, [rsi + 8*rcx]
vmovdqa ymm1, [rsi + 8*rcx + 0x20]
vmovdqa [rdi + rcx*8], ymm0
vmovdqa [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja avx_memcpy_forward_loop_llss
ret
avx_nt_memcpy_forward_llss:
shr rdx, 0x3
xor rcx, rcx
avx_nt_memcpy_forward_loop_llss:
vmovdqa ymm0, [rsi + 8*rcx]
vmovdqa ymm1, [rsi + 8*rcx + 0x20]
vmovntdq [rdi + rcx*8], ymm0
vmovntdq [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja avx_nt_memcpy_forward_loop_llss
ret
Benchmark code:
#include <stdio.h>
#include <inttypes.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <immintrin.h>
#include <x86intrin.h>
#include "memcopy.h"
#define BUF_SIZE 128 * 1024 * 1024
_Alignas(64) char src[BUF_SIZE];
_Alignas(64) char dest[BUF_SIZE];
static inline void warmup(unsigned wa_iterations, void *(*copy_fn)(void *, const void *, size_t));
static inline void cache_flush(char *buf, size_t size);
static inline void generate_data(char *buf, size_t size);
uint64_t run_benchmark(unsigned wa_iteration, void *(*copy_fn)(void *, const void *, size_t)){
generate_data(src, sizeof src);
warmup(4, copy_fn);
cache_flush(src, sizeof src);
cache_flush(dest, sizeof dest);
__asm__ __volatile__("mov $0, %%rax\n cpuid":::"rax", "rbx", "rcx", "rdx", "memory");
uint64_t cycles_start = __rdpmc((1 << 30) + 1);
copy_fn(dest, src, sizeof src);
__asm__ __volatile__("lfence" ::: "memory");
uint64_t cycles_end = __rdpmc((1 << 30) + 1);
return cycles_end - cycles_start;
}
int main(void){
uint64_t single_shot_result = run_benchmark(1024, avx_memcpy_forward_llss);
printf("Core clock cycles = %" PRIu64 "\n", single_shot_result);
}
static inline void warmup(unsigned wa_iterations, void *(*copy_fn)(void *, const void *, size_t)){
while(wa_iterations --> 0){
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
}
}
static inline void generate_data(char *buf, size_t sz){
int fd = open("/dev/urandom", O_RDONLY);
read(fd, buf, sz);
}
static inline void cache_flush(char *buf, size_t sz){
for(size_t i = 0; i < sz; i+=_SC_LEVEL1_DCACHE_LINESIZE){
_mm_clflush(buf + i);
}
}
Results:
avx_memcpy_forward_llss
median: 44479368 core cycles
UPD: time
real 0m0,217s
user 0m0,093s
sys 0m0,124s
avx_nt_memcpy_forward_llss
median: 24053086 core cycles
UPD: time
real 0m0,184s
user 0m0,056s
sys 0m0,128s
UPD: The result was gotten when running the benchmark with taskset -c 1 ./bin
So I got almost almost 2 times difference in core cycles between the memory copy routine implementation. I interpret it as in case of regular stores to WB memory we have RFO requests competing on bus bandwidth as it is specified in IOM/3.6.12 (emphasize mine):
Although the data bandwidth of full 64-byte bus writes due to non-temporal stores is twice that of bus writes to WB memory, transferring 8-byte chunks wastes bus request bandwidth and delivers significantly lower data bandwidth.
QUESTION 1: How to do benchmark analysis in case of a single shot? Perf counters does not seem to be useful due to perf startup overhead and warmup iteration overhead.
QUESTION 2: Is such benchmark correct. I accounted cpuid
in the beginning in order to start measuring with clean CPU resources to avoid stalls due to previous instruction in flight. I added memory clobbers as compile barrier and lfence
to avoid rdpmc
to be executed OoO.
Whenever possible, benchmarks should report results in ways that allow as much "sanity-checking" as possible. In this case, a few ways to enable such checks include:
Other important "best practice" principles:
If I assume that your processor was running at its maximum Turbo frequency of 4.6 GHz during the test, then the reported cycle counts correspond to 9.67 milliseconds and 5.23 milliseconds, respectively. Plugging these into a "sanity check" shows:
The failure of these "sanity checks" tells us that the frequency could not have been as high as 4.6 GHz (and was probably no higher than 3.0 GHz), but mostly just points to the need to measure the elapsed time unambiguously....
Your quote from the optimization manual on the inefficiency of streaming stores applies only to cases that cannot be coalesced into full cache line transfers. Your code stores to every element of the output cache lines following "best practice" recommendations (all store instructions writing to the same line are executed consecutively and generating only one stream of stores per loop). It is not possible to completely prevent the hardware from breaking up streaming stores, but in your case it should be extremely rare -- perhaps a few out of a million. Detecting partial streaming stores is a very advanced topic, requiring the use of poorly-documented performance counters in the "uncore" and/or indirect detection of partial streaming stores by looking for elevated DRAM CAS counts (which might be due to other causes). More notes on streaming stores are at http://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With