How can the instruction rep stosb
execute faster than this code?
Clear: mov byte [edi],AL ; Write the value in AL to memory
inc edi ; Bump EDI to next byte in the buffer
dec ecx ; Decrement ECX by one position
jnz Clear ; And loop again until ECX is 0
Is that guaranteed to be true on all modern CPUs? Should I always prefer to use rep stosb
instead of writing the loop manually?
In modern CPUs, rep stosb
's and rep movsb
's microcoded implementation actually uses stores that are wider than 1B, so it can go much faster than one byte per clock.
(Note this only applies to stos and movs, not repe cmpsb
or repne scasb
. They're still slow, unfortunately, like at best 2 cycles per byte compared on Skylake, which is pathetic vs. AVX2 vpcmpeqb
for implementing memcmp or memchr. See https://agner.org/optimize/ for instruction tables, and other perf links in the x86 tag wiki.
See Why is this code 6.5x slower with optimizations enabled? for an example of gcc unwisely inlining repnz scasb
or a less-bad scalar bithack for a strlen
that happens to get large, and a simple SIMD alternative.)
rep stos/movs
has significant startup overhead, but ramps up well for large memset/memcpy. (See the Intel/AMD's optimization manuals for discussion of when to use rep stos
vs. a vectorized loop for small buffers.) Without the ERMSB feature, though, rep stosb
is tuned for medium to small memsets and it's optimal to use rep stosd
or rep stosq
(if you aren't going to use a SIMD loop).
When single-stepping with a debugger, rep stos
only does one iteration (one decrement of ecx/rcx), so the microcode implementation never gets going. Don't let this fool you into thinking that's all it can do.
See What setup does REP do? for some details of how Intel P6/SnB-family microarchitectures implement rep movs
.
See Enhanced REP MOVSB for memcpy for memory-bandwidth considerations with rep movsb
vs. an SSE or AVX loop, on Intel CPUs with the ERMSB feature. (Note especially that many-core Xeon CPUs can't saturate DRAM bandwidth with only a single thread, because of limits on how many cache misses are in flight at once, and also RFO vs. non-RFO store protocols.)
A modern Intel CPU should run the asm loop in the question at one iteration per clock, but an AMD bulldozer-family core probably can't even manage one store per clock. (Bottleneck on the two integer execution ports handling the inc/dec/branch instructions. If the loop condition was a cmp/jcc on edi
, an AMD core could macro-fuse the compare-and-branch.)
One major feature of so-called Fast String operations (rep movs
and rep stos
on Intel P6 and SnB-family CPUs is that they avoid the read-for-ownership cache coherency traffic when storing to not-previously-cached memory. So it's like using NT stores to write whole cache lines, but still strongly ordered. (The ERMSB feature does use weakly-ordered stores).
IDK how good AMD's implementation is.
(And a correction: I previously said that Intel SnB can only handle a taken-branch throughput of one per 2 clocks, but in fact it can run tiny loops at one iteration per one clock.)
See the optimization resources (esp. Agner Fog's guides) linked from the x86 tag wiki.
Intel IvyBridge and later also ERMSB, which lets rep stos[b/w/d/q]
and rep movs[b/w/d/q]
use weakly-ordered stores (like movnt
), allowing the stores to commit to cache out-of-order. This is an advantage if not all of the destination is already hot in L1 cache. I believe, from the wording of the docs, that there's an implicit memory barrier at the end of a fast string op, so any reordering is only visible between stores made by the string op, not between it and other stores. i.e. you still don't need sfence
after rep movs
.
So for large aligned buffers on Intel IvB and later, a rep stos
implementation of memset
can beat any other implementation. One that uses movnt
stores (which don't leave the data in cache) should also be close to saturating main memory write bandwidth, but may in practice not quite keep up. See comments for discussion of this, but I wasn't able to find any numbers.
For small buffers, different approaches have very different amounts of overhead. Microbenchmarks can make SSE/AVX copy-loops look better than they are, because doing a copy with the same size and alignment every time avoids branch mispredicts in the startup/cleanup code. IIRC, it's recommended to use a vectorized loop for copies under 128B on Intel CPUs (not rep movs
). The threshold may be higher than that, depending on the CPU and the surrounding code.
Intel's optimization manual also has some discussion of overhead for different memcpy implementations, and that rep movsb
has a larger penalty for misalignment than movdqu
.
See the code for an optimized memset/memcpy implementation for more info on what is done in practice. (e.g. Agner Fog's library).
If your CPU has CPUID ERMSB bit, then rep movsb
and rep stosb
commands are executed differently than on older processors.
See Intel Optimization Reference Manual, section 3.7.6 Enhanced REP MOVSB and REP STOSB operation (ERMSB).
Both the manual and my tests show that the benefits of rep stosb
comparing to generic 32-bit register moves on a 32-bit CPU of Skylake microarchitecture appear only on large memory blocks, larger than 128 bytes. On smaller blocks, like 5 bytes, the code that you have shown (mov byte [edi],al
; inc edi
; dec ecx
; jnz Clear
) would be much faster, since the startup costs of rep stosb
are very high - about 35 cycles. However, this speed difference has diminished on Ice Lake microarchitecture launched in September 2019, introducing the Fast Short REP MOV (FSRM) feature. This feature can be tested by a CPUID bit. It was intended for 128 bytes and shorter strings to be quick, but, in fact, strings before 64 bytes are still slower with rep movsb than with, for example, simple 64-bit register copy. Besides that, FSRM is only implemented under 64-bit, not under 32-bit. At least on my i7-1065G7 CPU, rep movsb
is only quick for small strings under 64-bit, but, on 32-bit, strings have to be at least 4KB in order for rep movsb
to start outperforming other methods.
To get the benefits of rep stosb
on the processors with CPUID ERMSB bit, the following conditions should be met:
cld
instruction).According to the Intel Optimization Manual, ERMSB begins to outperform memory store via regular register on Skylake when the length of the memory block is at least 128 bytes. As I wrote, there is high internal startup ERMSB - about 35 cycles. ERMSB begins to clearly outperform other methods, including AVX copy and fill, when the length is more than 2048 bytes. However, this mainly applies to Skylake microarchitecture and not necessarily be the case for the other CPU microarchitectures.
On some processors, but not on the other, when the destination buffer is 16-byte aligned, REP STOSB using ERMSB can perform better than SIMD approaches, i.e., when using MMX or SSE registers. When the destination buffer is misaligned, memset() performance using ERMSB can degrade about 20% relative to the aligned case, for processors based on Intel microarchitecture code name Ivy Bridge. In contrast, SIMD implementation of REP STOSB will experience more negligible degradation when the destination is misaligned, according to Intel's optimization manual.
I've done some benchmarks. The code was filling the same fixed-size buffer many times, so the buffer stayed in cache (L1, L2, L3), depending on the size of the buffer. The number of iterations was such as the total execution time should be about two seconds.
On Intel Core i5 6600 processor, released on September 2015 and based on Skylake-S quad-core microarchitecture (3.30 GHz base frequency, 3.90 GHz Max Turbo frequency) with 4 x 32K L1 cache, 4 x 256K L2 cache and 6MB L3 cache, I could obtain ~100 GB/sec on REP STOSB with 32K blocks.
REP STOSB
:MOVDQA [RCX],XMM0
:Please note that the drawback of using the XMM0 register is that it is 128 bits (16 bytes) while I could have used YMM0 register of 256 bits (32 bytes). Anyway, stosb
uses the non-RFO protocol. Intel x86 have had "fast strings" since the Pentium Pro (P6) in 1996. The P6 fast strings took REP MOVSB and larger, and implemented them with 64 bit microcode loads and stores and a non-RFO cache protocol. They did not violate memory ordering, unlike ERMSB in Ivy Bridge. See https://stackoverflow.com/a/33905887/6910868 for more details and the source.
Anyway, even you compare just two of the methods that I have provided, and even though the second method is far from ideal, as you see, on 64-bit blocks rep stosb
is slower, but starting from 128-byte blocks, rep stosb
begin to outperform other methods, and the difference is very significant starting from 512-byte blocks and longer, provided that you are clearing the same memory block again and again within the cache.
Therefore, for REP STOSB
, maximum speed was 103957 (one hundred three thousand nine hundred fifty-seven) Megabytes per second, while with MOVDQA [RCX],XMM0 it was just 26569 (twenty-six thousand five hundred sixty-nine) twenty-six thousand five hundred sixty-nine.
As you see, the highest performance was on 32K blocks, which is equal to 32K L1 cache of the CPU on which I've made the benchmarks.
I have also done tests on an Intel i7 1065G7 CPU, released in August 2019 (Ice Lake/Sunny Cove microarchitecture), Base frequency: 1.3 GHz, Max Turbo frequency 3.90 GHz. It supports AVX512F instruction set. It has 4 x 32K L1 instruction cache and 4 x 48K data cache, 4x512K L2 cache and 8 MB L3 cache.
On 32K blocks zeroized by rep stosb
, performance was from 175231 MB/s for destination misaligned by 1 byte (e.g. $7FF4FDCFFFFF) and quickly rose to 219464 MB/s for aligned by 64 bytes (e.g. $7FF4FDCFFFC0), and then gradually rose to 222424 MB/sec for destinations aligned by 256 bytes (Aligned to 256 bytes, i.e. $7FF4FDCFFF00). After that, the speed did not rise, even if destination was aligned by 32KB (e.g. $7FF4FDD00000), and was still 224850 MB/sec.
There was no difference in speed between rep stosb
and rep stosq
.
On buffers aligned by 32K, the speed of AVX-512 store was exactly the same as for rep stosb
, for loops starting from 2 stores in a loop (227777 MB/sec) and didn't grow for loops unrolled for 4 and even 16 stores. However, for a loop of just 1 store the speed was a little bit lower - 203145 MB/sec.
However, if the destination buffer was misaligned by just 1 byte, the speed of AVX512 store dropped dramatically, i.e. more than 2 times, to 93811 MB/sec, in contrast to rep stosb
on similar buffers, which gave 175231 MB/sec.
rep stosb
(71817 MB/s)rep stosb
dropped to 38682 MB/s. At this block type, the difference was 5 times in favor of AVX-512.rep stosb
it was 123207 MB/s, almost twice slower. Again, there was no difference between rep stosb
and rep stosq
.rep stosb
: 220515 MB/s - now at last! We are approaching the L0 data cache size of my CPU - 48Kb! This is 220 Gigabytes per second!rep stosb
: 70395 MB/s!rep stosb
begins to outperform AVX-512 stores.rep stosb
made 70653 MB/s. That's where rep stosb
begins to outperform AVX-512. The difference is not yet significant, but the bigger the buffer, the bigger the difference.rep stosb
it as 27412 MB/s, i.e. twice as fast as AVX-512!I've also tried to use non-temporal instructions for filling the 32K buffers vmovntdq [rcx], zmm31
, but the performance was about 4 time slower than just vmovdqa64 [rcx], zmm31
. How can I take benefits of vmovntdq
when filling memory buffers? Should there be some specific size of the buffer in order vmovntdq
to take an advantage?
Also, if the destination buffers are aligned by at least 64 bits, there is no performance difference in vmovdqa64
vs vmovdqu64
. Therefore, I do have a question: does the instruction vmovdqa64
is only needed for debugging and safety when we have vmovdqu64
?
Figure 1: Speed of iterative store to the same buffer, MB/s
block AVX stosb
----- ----- ------
0.5K 194181 38682
1K 205039 205039
2K 210696 123207
4K 225179 180384
8K 222259 194358
32K 228432 220515
64K 61405 70395
512K 62907 70653
1G 14319 27412
rep stosb
on Ice Lake CPUs begins to outperform AVX-512 stores only for repeatedly clearing the same memory buffer larger than the L0 cache size, i.e. 48K on the Intel i7 1065G7 CPU. And on small memory buffers, AVX-512 stores are much faster: for 1KB - 3 times faster, for 512 bytes - 5 times faster.
However, the AVX-512 stores are susceptible to misaligned buffers, while rep stosb
is not as sensitive to misalignment.
Therefore, I have figured out that rep stosb
begins to outperform AVX-512 stores only on buffers that exceed L0 data cache size, or 48KB as in case of the Intel i7 1065G7 CPU. This conclusion is valid at least on Ice Lake CPUs. An earlier Intel recommendation that string copy begins to outperform AVX copy starting from 2KB buffers also should be re-tested for newer microarchitectures.
My previous benchmarks were filling the same buffer many times in row. A better benchmark might be to allocate many different buffers and only fill each buffer once, to not interfere with the cache.
In this scenario, there is no much difference at all between rep stosb
and AVX-512 stores. The only difference is when all the data does not come close to a physical memory limit, under Windows 10 64 bit. In the following benchmarks, the total data size was below 8 GB with total physical ram of 16 GB. When I was allocating about 12 GB, performance drops about 20 times, regardless of the method. Windows began to discard purged memory pages, and probably did some other stuff when the memory was about to be full. The L3 cache size of 8MB on the i7 1065G7 CPU did not seem to matter the benchmarks at all. All that matters is that you didn't have to come close to physical memory limit, and it depends on your operating system on how it handles such situations. As I said, under Windows 10, if I took just half physical memory, it was OK, but it I took 3/4 of available memory, my benchmark slowed 20 times. I didn't even try to take more than 3/4. As I told, the total memory size is 16 GB. The amount available, according to the task manager, was 12 GB.
Here is the benchmark of the speed of filling various blocks of memory totalling 8 GB with zeros (in MB/sec) on the i7 1065G7 CPU with 16 GB total memory, single-threaded. By "AVX" I mean "AVX-512" normal stores, and by "stosb" I mean "rep stosb".
Figure 2: Speed of store to the multiple buffers, once each, MB/s
block AVX stosb
----- ---- ----
0.5K 3641 2759
1K 4709 3963
2K 12133 13163
4K 8239 10295
8K 3534 4675
16K 3396 3242
32K 3738 3581
64K 2953 3006
128K 3150 2857
256K 3773 3914
512K 3204 3680
1024K 3897 4593
2048K 4379 3234
4096K 3568 4970
8192K 4477 5339
If your memory does not exist in the cache, than the performance of AVX-512 stores and rep stosb
is about the same when you need to fill memory with zeros. It is the cache that matters, not the choice between these two methods.
I was zeroizing 6-10 GB of memory split by a sequence of buffers aligned by 64 bytes. No buffers were zeroized twice. Smaller buffers had some overhead, and I had only 16 GB of physical memory, so I zeroized less memory in total with smaller buffers. I used various tests for the buffers starting from 256 bytes and up to to 8 GB per buffer. I took 3 different methods:
vmovdqa64 [rcx+imm], zmm31
(a loop of 4 stores and then compare the counter);vmovntdq [rcx+imm], zmm31
(same loop of 4 stores);rep stosb
.For small buffers, the normal AVX-512 store was the winner. Then, starting from 4KB, the non-temporal store took the lead, while rep stosb
still lagged behind.
Then, from 256KB, rep stosb
outperformed AVX-512, but not the non-temporal store, and since that, the situation didn’t change. The winner was a non-temporal AVX-512 store, then came rep stosb
and then the normal AVX-512 store.
Figure 3. Speed of store to the multiple buffers, once each, MB/s by three different methods: normal AVX-512 store, nontemporal AVX-512 store and rep stosb.
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 2.90s, 2.30 GB/s by normal AVX-512 store
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 3.05s, 2.18 GB/s by nontemporal AVX-512 store
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 3.05s, 2.18 GB/s by rep stosb
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.06s, 2.62 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.02s, 2.65 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.66s, 2.18 GB/s by rep stosb
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 3.10s, 2.87 GB/s by normal AVX-512 store
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 3.37s, 2.64 GB/s by nontemporal AVX-512 store
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 4.85s, 1.83 GB/s by rep stosb
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 3.45s, 2.73 GB/s by normal AVX-512 store
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 3.79s, 2.48 GB/s by nontemporal AVX-512 store
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 4.83s, 1.95 GB/s by rep stosb
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 4.40s, 2.20 GB/s by normal AVX-512 store
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 3.46s, 2.81 GB/s by nontemporal AVX-512 store
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 4.40s, 2.20 GB/s by rep stosb
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 3.24s, 3.04 GB/s by normal AVX-512 store
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 2.65s, 3.71 GB/s by nontemporal AVX-512 store
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 3.35s, 2.94 GB/s by rep stosb
Zeroized 9.92 GB: 650279 blocks of 16 KB for 3.37s, 2.94 GB/s by normal AVX-512 store
Zeroized 9.92 GB: 650279 blocks of 16 KB for 2.73s, 3.63 GB/s by nontemporal AVX-512 store
Zeroized 9.92 GB: 650279 blocks of 16 KB for 3.53s, 2.81 GB/s by rep stosb
Zeroized 9.96 GB: 326404 blocks of 32 KB for 3.19s, 3.12 GB/s by normal AVX-512 store
Zeroized 9.96 GB: 326404 blocks of 32 KB for 2.64s, 3.77 GB/s by nontemporal AVX-512 store
Zeroized 9.96 GB: 326404 blocks of 32 KB for 3.44s, 2.90 GB/s by rep stosb
Zeroized 9.98 GB: 163520 blocks of 64 KB for 3.08s, 3.24 GB/s by normal AVX-512 store
Zeroized 9.98 GB: 163520 blocks of 64 KB for 2.58s, 3.86 GB/s by nontemporal AVX-512 store
Zeroized 9.98 GB: 163520 blocks of 64 KB for 3.29s, 3.03 GB/s by rep stosb
Zeroized 9.99 GB: 81840 blocks of 128 KB for 3.22s, 3.10 GB/s by normal AVX-512 store
Zeroized 9.99 GB: 81840 blocks of 128 KB for 2.49s, 4.01 GB/s by nontemporal AVX-512 store
Zeroized 9.99 GB: 81840 blocks of 128 KB for 3.26s, 3.07 GB/s by rep stosb
Zeroized 10.00 GB: 40940 blocks of 256 KB for 2.52s, 3.97 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 40940 blocks of 256 KB for 1.98s, 5.06 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 40940 blocks of 256 KB for 2.43s, 4.11 GB/s by rep stosb
Zeroized 10.00 GB: 20475 blocks of 512 KB for 2.15s, 4.65 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 20475 blocks of 512 KB for 1.70s, 5.87 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 20475 blocks of 512 KB for 1.81s, 5.53 GB/s by rep stosb
Zeroized 10.00 GB: 10238 blocks of 1 MB for 2.18s, 4.59 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 10238 blocks of 1 MB for 1.50s, 6.68 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 10238 blocks of 1 MB for 1.63s, 6.13 GB/s by rep stosb
Zeroized 10.00 GB: 5119 blocks of 2 MB for 2.02s, 4.96 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 5119 blocks of 2 MB for 1.59s, 6.30 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 5119 blocks of 2 MB for 1.54s, 6.50 GB/s by rep stosb
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.90s, 5.26 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.37s, 7.29 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.47s, 6.81 GB/s by rep stosb
Zeroized 9.99 GB: 1279 blocks of 8 MB for 2.04s, 4.90 GB/s by normal AVX-512 store
Zeroized 9.99 GB: 1279 blocks of 8 MB for 1.51s, 6.63 GB/s by nontemporal AVX-512 store
Zeroized 9.99 GB: 1279 blocks of 8 MB for 1.56s, 6.41 GB/s by rep stosb
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.93s, 5.18 GB/s by normal AVX-512 store
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.37s, 7.30 GB/s by nontemporal AVX-512 store
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.45s, 6.89 GB/s by rep stosb
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.95s, 5.11 GB/s by normal AVX-512 store
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.41s, 7.06 GB/s by nontemporal AVX-512 store
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.42s, 7.02 GB/s by rep stosb
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.85s, 5.38 GB/s by normal AVX-512 store
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.33s, 7.47 GB/s by nontemporal AVX-512 store
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.40s, 7.09 GB/s by rep stosb
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.99s, 4.96 GB/s by normal AVX-512 store
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.42s, 6.97 GB/s by nontemporal AVX-512 store
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.55s, 6.37 GB/s by rep stosb
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.83s, 5.32 GB/s by normal AVX-512 store
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.32s, 7.38 GB/s by nontemporal AVX-512 store
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.64s, 5.93 GB/s by rep stosb
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.89s, 5.02 GB/s by normal AVX-512 store
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.31s, 7.27 GB/s by nontemporal AVX-512 store
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.42s, 6.71 GB/s by rep stosb
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.76s, 5.13 GB/s by normal AVX-512 store
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.26s, 7.12 GB/s by nontemporal AVX-512 store
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.29s, 7.00 GB/s by rep stosb
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.48s, 5.42 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.07s, 7.49 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.15s, 6.94 GB/s by rep stosb
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.48s, 5.40 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.08s, 7.40 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.14s, 7.00 GB/s by rep stosb
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.50s, 5.35 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.07s, 7.47 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.21s, 6.63 GB/s by rep stosb
For all the AVX-512 code, I've used the ZMM31
register, because SSE registers come from 0 to to 15, so the AVX-512 registers 16 to 31 do not have their SSE counterparts, thus do not incur the transition penalty.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With