Today I have found sample code which slowed down by 50%, after adding some unrelated code. After debugging I have figured out the problem was in the loop alignment. Depending of the loop code placement there is different execution time e.g.:
Address | Time[us] |
---|---|
00007FF780A01270 | 980us |
00007FF7750B1280 | 1500us |
00007FF7750B1290 | 986us |
00007FF7750B12A0 | 1500us |
I didn't expect previously that code alignment may have such a big impact. And I thought my compiler is smart enough to align the code correctly.
What exactly cause such a big difference in execution time ? (I suppose some processor architecture details).
The test program I have compiled in Release mode with Visual Studio 2019 and run it on Windows 10. I have checked the program on 2 processors: i7-8700k (the results above), and on intel i5-3570k but the problem does not exist there and the execution time is always about 1250us. I have also tried to compile the program with clang, but with clang the result is always ~1500us (on i7-8700k).
My test program:
#include <chrono>
#include <iostream>
#include <intrin.h>
using namespace std;
template<int N>
__forceinline void noops()
{
__nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop();
noops<N - 1>();
}
template<>
__forceinline void noops<0>(){}
template<int OFFSET>
__declspec(noinline) void SumHorizontalLine(const unsigned char* __restrict src, int width, int a, unsigned short* __restrict dst)
{
unsigned short sum = 0;
const unsigned char* srcP1 = src - a - 1;
const unsigned char* srcP2 = src + a;
//some dummy loop,just a few iterations
for (int i = 0; i < a; ++i)
dst[i] = src[i] / (double)dst[i];
noops<OFFSET>();
//the important loop
for (int x = a + 1; x < width - a; x++)
{
unsigned char v1 = srcP1[x];
unsigned char v2 = srcP2[x];
sum -= v1;
sum += v2;
dst[x] = sum;
}
}
template<int OFFSET>
void RunTest(unsigned char* __restrict src, int width, int a, unsigned short* __restrict dst)
{
double minTime = 99999999;
for(int i = 0; i < 20; ++i)
{
auto start = chrono::steady_clock::now();
for (int i = 0; i < 1024; ++i)
{
SumHorizontalLine<OFFSET>(src, width, a, dst);
}
auto end = chrono::steady_clock::now();
auto us = chrono::duration_cast<chrono::microseconds>(end - start).count();
if (us < minTime)
{
minTime = us;
}
}
cout << OFFSET << " : " << minTime << " us" << endl;
}
int main()
{
const int width = 2048;
const int x = 3;
unsigned char* src = new unsigned char[width * 5];
unsigned short* dst = new unsigned short[width];
memset(src, 0, sizeof(unsigned char) * width);
memset(dst, 0, sizeof(unsigned short) * width);
while(true)
RunTest<1>(src, width, x, dst);
}
To verify different alignment, just recompile the program and change RunTest<0> to RunTest<1> etc. Compiler always align the code to 16bytes. In my test code I just insert additional nops to move the code a bit more.
Assembly code generated for the loop with OFFSET=1 (for other offset only the amount of npads is different):
0007c 90 npad 1
0007d 90 npad 1
0007e 49 83 c1 08 add r9, 8
00082 90 npad 1
00083 90 npad 1
00084 90 npad 1
00085 90 npad 1
00086 90 npad 1
00087 90 npad 1
00088 90 npad 1
00089 90 npad 1
0008a 90 npad 1
0008b 90 npad 1
0008c 90 npad 1
0008d 90 npad 1
0008e 90 npad 1
0008f 90 npad 1
$LL15@SumHorizon:
; 25 :
; 26 : noops<OFFSET>();
; 27 :
; 28 : for (int x = a + 1; x < width - a; x++)
; 29 : {
; 30 : unsigned char v1 = srcP1[x];
; 31 : unsigned char v2 = srcP2[x];
; 32 : sum -= v1;
00090 0f b6 42 f9 movzx eax, BYTE PTR [rdx-7]
00094 4d 8d 49 02 lea r9, QWORD PTR [r9+2]
; 33 : sum += v2;
00098 0f b6 0a movzx ecx, BYTE PTR [rdx]
0009b 48 8d 52 01 lea rdx, QWORD PTR [rdx+1]
0009f 66 2b c8 sub cx, ax
000a2 66 44 03 c1 add r8w, cx
; 34 : dst[x] = sum;
000a6 66 45 89 41 fe mov WORD PTR [r9-2], r8w
000ab 49 83 ea 01 sub r10, 1
000af 75 df jne SHORT $LL15@SumHorizon
; 35 : }
; 36 :
; 37 : }
000b1 c3 ret 0
??$SumHorizontalLine@$00@@YAXPEIBEHHPEIAG@Z ENDP ; SumHorizont
So aligned data is performance-critical. The good news is that you mostly don't actually have to care. Almost any compiler for almost any language will be producing machine code which respects the target system's alignment requirements.
Normally, the microcode will fetch the proper 4-byte quantity from memory, but if it's not aligned, it will have to fetch two 4-byte locations from memory and reconstruct the desired 4-byte quantity from the appropriate bytes of the two locations The SSE set of instructions require special alignment.
An effective alignment takes into consideration the contact patch of rubber to the road and how the forces of the vehicle travel through the suspension and tires. There are three primary components that influence these areas: toe, camber and caster.
A performance test you can very easily do yourself. add or remove nops around the code under test and do an accurate job of timing, move the instructions under test along a wide enough range of addresses to touch the edges of cache lines, etc. Same kind of thing with data accesses.
In the slow cases (i.e., 00007FF7750B1280 and 00007FF7750B12A0), the jne
instruction crosses a 32-byte boundary. The mitigations for the "Jump Conditional Code" (JCC) erratum (https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf) prevent such instructions from being cached in the DSB. The JCC erratum only applies to Skylake-based CPUs, which is why the effect does not occur on your i5-3570k CPU.
As Peter Cordes pointed out in a comment, recent compilers have options that try to mitigate this effect. Intel JCC Erratum - should JCC really be treated separately? mentions MSVC's /QIntel-jcc-erratum
option; another related question is How can I mitigate the impact of the Intel jcc erratum on gcc?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With