I've written the following code for benchmarking the effect of cache misses on performance:
#include <chrono>
#include <cstdint>
#include <cstring>
#include <iostream>
// Avoiding using power of 2 because of possible performance degradation due to cache associativity?
static const size_t ROW_SIZE = 600;
static const size_t COL_SIZE = 600;
static const size_t TEST_COUNT = 50;
#define SAME_TYPES 1
#define INIT_TO_ONE 0
#if SAME_TYPES
#define ARY_TYPE uint64_t
#define RET_TYPE uint64_t
#else
#define ARY_TYPE uint32_t
#define RET_TYPE uint64_t
#endif
RET_TYPE sum_array_rows(const ARY_TYPE *a, const size_t step)
{
RET_TYPE sum = 0;
for (size_t row = 0; row < ROW_SIZE; row++)
for (size_t col = 0; col < COL_SIZE; col++)
sum += static_cast<RET_TYPE>(a[row * step + col]);
return sum;
}
RET_TYPE sum_array_cols(const ARY_TYPE *a, const size_t step)
{
RET_TYPE sum = 0;
for (size_t col = 0; col < COL_SIZE; col++)
for (size_t row = 0; row < ROW_SIZE; row++)
sum += static_cast<RET_TYPE>(a[row * step + col]);
return sum;
}
int main()
{
#if INIT_TO_ONE
ARY_TYPE *a = new ARY_TYPE[ROW_SIZE * COL_SIZE];
for (int i = 0; i < ROW_SIZE * COL_SIZE; i++) a[i] = 1;
#else
ARY_TYPE *a = new ARY_TYPE[ROW_SIZE * COL_SIZE]();
#endif
std::chrono::high_resolution_clock hrc;
std::cout << "SAME_TYPES: " << SAME_TYPES << "\n";
std::cout << "INIT_TO_ONE: " << INIT_TO_ONE << "\n";
std::cout << "ROW_SIZE: " << ROW_SIZE << "\n";
std::cout << "COL_SIZE: " << COL_SIZE << "\n\n";
{
RET_TYPE sum = 0;
auto start = hrc.now();
for (int i = 0; i < TEST_COUNT; i++) sum = sum_array_rows(a, COL_SIZE);
auto end = hrc.now();
std::cout << "Time taken: " << (end - start).count() / TEST_COUNT << "\n";
std::cout << "Sum: " << sum << "\n";
}
// I've added this because I want to trash the cache
// I'm not sure if this is necessary or if it's doing any good...
ARY_TYPE *other = new ARY_TYPE[ROW_SIZE * COL_SIZE];
for (int i = 0; i < ROW_SIZE * COL_SIZE; i++) other[i] = 1;
{
RET_TYPE sum = other[ROW_SIZE] - 1;
auto start = hrc.now();
for (int i = 0; i < TEST_COUNT; i++) sum = sum_array_cols(a, COL_SIZE);
auto end = hrc.now();
std::cout << "Time taken: " << (end - start).count() / TEST_COUNT << "\n";
std::cout << "Sum: " << sum << "\n";
}
return 0;
}
I have two functions sum_array_rows
and sum_array_cols
which take in an array and add the elements, and return the sum. The difference is the order in which we access the elements. The return type of both functions is always uint64_t
.
I have a define near the top of the file called SAME_TYPES
. If SAME_TYPES
then both return type and type of elements is uint64_t
. If not SAME_TYPES
then type of elements is uint32_t
.
So running this code gives me...
With SAME_TYPES
set to 1:
SAME_TYPES: 1
INIT_TO_ONE: 0
ROW_SIZE: 600
COL_SIZE: 600
Time taken: 80948 (element type is uint64_t, ROW first)
Sum: 0
Time taken: 566241 (element type is uint64_t, COL first)
Sum: 0
With SAME_TYPES
set to 0:
SAME_TYPES: 0
INIT_TO_ONE: 0
ROW_SIZE: 600
COL_SIZE: 600
Time taken: 283348 (element type is uint32_t, ROW first)
Sum: 0
Time taken: 369653 (element type is uint32_t, COL first)
Sum: 0
I'm confused why this is having any effect on performance. When the data type is the same then the result seems expected, row-col is faster than col-row. But why, when the types are different, do I see a decrease in row-col and an increase in col-row. I understand that if the data type of the element is larger then I can fit less into my cache, but then why am I seeing an increase in performance when iterating over columns in the outer loop. Can someone please explain this to me?
In case it matters, I'm using VS 2015 for compiling this and my CPU is i5-4460 (running Windows 10).
Edit: I guess I shouldn't have blindly trusted the compiler. I was originally compiling using /02
which is the default setting. When I compile with no optimisation I get expected behaviour:
With SAME_TYPES
set to 1:
SAME_TYPES: 1
INIT_TO_ONE: 0
ROW_SIZE: 600
COL_SIZE: 600
Time taken: 1009518 (element type is uint64_t, ROW first)
Sum: 0
Time taken: 1504863 (element type is uint64_t, COL first)
Sum: 0
With SAME_TYPES
set to 0:
SAME_TYPES: 0
INIT_TO_ONE: 0
ROW_SIZE: 600
COL_SIZE: 600
Time taken: 909479 (element type is uint32_t, ROW first)
Sum: 0
Time taken: 1244492 (element type is uint32_t, COL first)
Sum: 0
The data type still has an effect but it now seems reasonable. Here is the assembly when compiling with optimisation on:
Optimisation ... /O2
SAME_TYPE ...... 1
RET_TYPE sum_array_rows(const ARY_TYPE *a, const size_t step)
{
00FD10C0 xorps xmm2,xmm2
RET_TYPE sum = 0;
00FD10C3 lea eax,[ecx+10h]
00FD10C6 movaps xmm1,xmm2
00FD10C9 mov edx,258h
00FD10CE xchg ax,ax
for (size_t col = 0; col < COL_SIZE; col++)
00FD10D0 mov ecx,96h
sum += static_cast<RET_TYPE>(a[row * step + col]);
00FD10D5 movups xmm0,xmmword ptr [eax-10h]
00FD10D9 paddq xmm2,xmm0
00FD10DD movups xmm0,xmmword ptr [eax]
00FD10E0 add eax,20h
00FD10E3 paddq xmm1,xmm0
00FD10E7 sub ecx,1
00FD10EA jne sum_array_rows+15h (0FD10D5h)
for (size_t row = 0; row < ROW_SIZE; row++)
00FD10EC sub edx,1
for (size_t row = 0; row < ROW_SIZE; row++)
00FD10EF jne sum_array_rows+10h (0FD10D0h)
return sum;
}
00FD10F1 paddq xmm1,xmm2
00FD10F5 movaps xmm0,xmm1
00FD10F8 psrldq xmm0,8
00FD10FD paddq xmm1,xmm0
00FD1101 movd eax,xmm1
00FD1105 psrldq xmm1,4
00FD110A movd edx,xmm1
00FD110E ret
RET_TYPE sum_array_cols(const ARY_TYPE *a, const size_t step)
{
00FD1110 push ebp
00FD1111 mov ebp,esp
00FD1113 sub esp,24h
00FD1116 push ebx
00FD1117 xorps xmm0,xmm0
RET_TYPE sum = 0;
00FD111A mov dword ptr [ebp-10h],258h
00FD1121 push esi
00FD1122 movlpd qword ptr [ebp-18h],xmm0
00FD1127 lea eax,[ecx+2580h]
00FD112D mov edx,dword ptr [ebp-14h]
00FD1130 push edi
00FD1131 mov edi,dword ptr [sum]
00FD1134 mov dword ptr [ebp-0Ch],eax
00FD1137 nop word ptr [eax+eax]
for (size_t row = 0; row < ROW_SIZE; row++)
00FD1140 xorps xmm0,xmm0
00FD1143 mov dword ptr [ebp-8],0C8h
00FD114A movlpd qword ptr [sum],xmm0
00FD114F mov ecx,dword ptr [ebp-18h]
00FD1152 mov ebx,dword ptr [ebp-14h]
00FD1155 movlpd qword ptr [ebp-20h],xmm0
00FD115A mov esi,dword ptr [ebp-20h]
00FD115D mov dword ptr [ebp-4],ecx
00FD1160 mov ecx,dword ptr [ebp-1Ch]
00FD1163 nop dword ptr [eax]
00FD1167 nop word ptr [eax+eax]
sum += static_cast<RET_TYPE>(a[row * step + col]);
00FD1170 add edi,dword ptr [eax-2580h]
00FD1176 mov dword ptr [sum],edi
00FD1179 adc edx,dword ptr [eax-257Ch]
00FD117F mov edi,dword ptr [ebp-4]
00FD1182 add edi,dword ptr [eax-12C0h]
00FD1188 mov dword ptr [ebp-4],edi
00FD118B adc ebx,dword ptr [eax-12BCh]
00FD1191 add esi,dword ptr [eax]
00FD1193 mov edi,dword ptr [sum]
00FD1196 adc ecx,dword ptr [eax+4]
00FD1199 lea eax,[eax+3840h]
00FD119F sub dword ptr [ebp-8],1
00FD11A3 jne sum_array_cols+60h (0FD1170h)
for (size_t col = 0; col < COL_SIZE; col++)
00FD11A5 add esi,dword ptr [ebp-4]
00FD11A8 mov eax,dword ptr [ebp-0Ch]
00FD11AB adc ecx,ebx
00FD11AD add edi,esi
00FD11AF adc edx,ecx
00FD11B1 add eax,8
00FD11B4 sub dword ptr [ebp-10h],1
00FD11B8 mov dword ptr [ebp-0Ch],eax
00FD11BB jne sum_array_cols+30h (0FD1140h)
return sum;
00FD11BD mov eax,edi
}
00FD11BF pop edi
00FD11C0 pop esi
00FD11C1 pop ebx
00FD11C2 mov esp,ebp
00FD11C4 pop ebp
00FD11C5 ret
================
Optimisation ... /O2
SAME_TYPE ...... 0
RET_TYPE sum_array_rows(const ARY_TYPE *a, const size_t step)
{
00A110C0 push ebp
00A110C1 mov ebp,esp
00A110C3 sub esp,24h
00A110C6 push ebx
00A110C7 xorps xmm0,xmm0
RET_TYPE sum = 0;
00A110CA mov dword ptr [ebp-0Ch],258h
00A110D1 movlpd qword ptr [ebp-18h],xmm0
00A110D6 mov edx,dword ptr [ebp-14h]
00A110D9 mov eax,dword ptr [sum]
00A110DC push esi
00A110DD push edi
00A110DE xchg ax,ax
for (size_t col = 0; col < COL_SIZE; col++)
00A110E0 xorps xmm0,xmm0
00A110E3 mov dword ptr [ebp-8],0C8h
00A110EA movlpd qword ptr [sum],xmm0
00A110EF mov esi,dword ptr [ebp-18h]
00A110F2 mov ebx,dword ptr [ebp-14h]
for (size_t col = 0; col < COL_SIZE; col++)
00A110F5 movlpd qword ptr [ebp-20h],xmm0
00A110FA mov edi,dword ptr [ebp-20h]
00A110FD mov dword ptr [ebp-4],esi
00A11100 mov esi,dword ptr [ebp-1Ch]
sum += static_cast<RET_TYPE>(a[row * step + col]);
00A11103 add eax,dword ptr [ecx]
00A11105 adc edx,0
00A11108 mov dword ptr [sum],edx
00A1110B mov edx,dword ptr [ebp-4]
00A1110E add edx,dword ptr [ecx+4]
00A11111 mov dword ptr [ebp-4],edx
00A11114 mov edx,dword ptr [sum]
00A11117 adc ebx,0
00A1111A add edi,dword ptr [ecx+8]
00A1111D adc esi,0
00A11120 add ecx,0Ch
00A11123 sub dword ptr [ebp-8],1
00A11127 jne sum_array_rows+43h (0A11103h)
for (size_t row = 0; row < ROW_SIZE; row++)
00A11129 add edi,dword ptr [ebp-4]
00A1112C adc esi,ebx
00A1112E add eax,edi
00A11130 adc edx,esi
00A11132 sub dword ptr [ebp-0Ch],1
00A11136 jne sum_array_rows+20h (0A110E0h)
return sum;
}
00A11138 pop edi
00A11139 pop esi
00A1113A pop ebx
00A1113B mov esp,ebp
00A1113D pop ebp
00A1113E ret
RET_TYPE sum_array_cols(const ARY_TYPE *a, const size_t step)
{
00A11140 push ebp
00A11141 mov ebp,esp
00A11143 sub esp,24h
00A11146 push ebx
00A11147 xorps xmm0,xmm0
RET_TYPE sum = 0;
00A1114A mov dword ptr [ebp-10h],258h
00A11151 push esi
00A11152 movlpd qword ptr [ebp-18h],xmm0
00A11157 lea eax,[ecx+12C0h]
00A1115D mov edx,dword ptr [ebp-14h]
00A11160 push edi
00A11161 mov edi,dword ptr [sum]
00A11164 mov dword ptr [ebp-0Ch],eax
00A11167 nop word ptr [eax+eax]
for (size_t row = 0; row < ROW_SIZE; row++)
00A11170 xorps xmm0,xmm0
00A11173 mov dword ptr [ebp-8],0C8h
00A1117A movlpd qword ptr [sum],xmm0
00A1117F mov ecx,dword ptr [ebp-18h]
00A11182 mov ebx,dword ptr [ebp-14h]
00A11185 movlpd qword ptr [ebp-20h],xmm0
00A1118A mov esi,dword ptr [ebp-20h]
00A1118D mov dword ptr [ebp-4],ecx
00A11190 mov ecx,dword ptr [ebp-1Ch]
00A11193 nop dword ptr [eax]
00A11197 nop word ptr [eax+eax]
sum += static_cast<RET_TYPE>(a[row * step + col]);
00A111A0 add edi,dword ptr [eax-12C0h]
00A111A6 lea eax,[eax+1C20h]
00A111AC adc edx,0
00A111AF mov dword ptr [sum],edx
00A111B2 mov edx,dword ptr [ebp-4]
00A111B5 add edx,dword ptr [eax-2580h]
00A111BB mov dword ptr [ebp-4],edx
00A111BE mov edx,dword ptr [sum]
00A111C1 adc ebx,0
00A111C4 add esi,dword ptr [eax-1C20h]
00A111CA adc ecx,0
00A111CD sub dword ptr [ebp-8],1
00A111D1 jne sum_array_cols+60h (0A111A0h)
for (size_t col = 0; col < COL_SIZE; col++)
00A111D3 add esi,dword ptr [ebp-4]
00A111D6 mov eax,dword ptr [ebp-0Ch]
00A111D9 adc ecx,ebx
00A111DB add edi,esi
00A111DD adc edx,ecx
00A111DF add eax,4
00A111E2 sub dword ptr [ebp-10h],1
00A111E6 mov dword ptr [ebp-0Ch],eax
00A111E9 jne sum_array_cols+30h (0A11170h)
return sum;
00A111EB mov eax,edi
}
00A111ED pop edi
}
00A111EE pop esi
00A111EF pop ebx
00A111F0 mov esp,ebp
00A111F2 pop ebp
00A111F3 ret
Choosing the right data types for your tables, stored procedures, and variables not only improves performance by ensuring a correct execution plan, but it also improves data integrity by ensuring that the correct data is stored within a database.
A data type is an attribute associated with a piece of data that tells a computer system how to interpret its value. Understanding data types ensures that data is collected in the preferred format and the value of each property is as expected.
In SQL Server, if two fields have different data types, their values aren't considered the same, even if they appear identical to an outside observer. Value conversions in SQL Server follow preset precedence rules. Small data types are always up-converted to larger data types. Then SQL Server can compare the values.
A data type is a description of the kind of data in a table column. Each database system recognises its own set of datatypes, although some are common to many. Typical examples will be Integer or Text.
The short answer is that the optimizer is trying to be clever but its heuristics are failing in the case with disparate type sizes, which ends up slowing down the code.
Let begin by reviewing the code generated for the innerloop of sum_array_rows
with 64-bit source data.
innerloop:
movups xmm0,[eax-0x10]
paddq xmm2,xmm0
movups xmm0,[eax]
add eax,0x20
paddq xmm1,xmm0
sub ecx,1
jne innerloop
This is roughly equivalent to the following code C using intrinsics.
do {
sum1 = _mm_add_epi64(sum1, _mm_loadu_si128(&ptr[0]));
sum2 = _mm_add_epi64(sum2, _mm_loadu_si128(&ptr[1]));
ptr += 2;
} while(--count);
What we see here is that the optimizer has determined that addition is associative and consequently unrolled the loop onto four parallel accumulators, which eventually get summed together at the end of the loop. This permits the independent computations to be executed in parallel by the CPU and more importantly permits vectorization by using the SSE2
instruction set to add pairs of 64-bit integers in a single instruction.
This is a good result.
On the other side of the spectrum we have the version with 64-bit accumulation of 32-bit source data:
innerloop:
add eax,[ecx]
adc edx,0
mov [sum],edx
mov edx,[ebp-4]
add edx,[ecx+4]
mov [ebp-4],edx
mov edx,[sum]
adc ebx,0
add edi,[ecx+8]
adc esi,0
add ecx,12
sub [ebp-8],1
jne innerloop
Note that the 32-bit x86 target lacks native support for 64-bit arithmatic using the normal (non vector) instruction set. Consequently each accumulator is split up into individual upper and lower word variables, with carries from the lower word manually propagated into the upper word.
Furthermore the loop is unrolled three times instead of four.
The pseudo-C version reads roughly like this:
do {
sum1_lo += ptr[0]; if(carry) ++sum1_hi;
sum2_lo += ptr[1]; if(carry) ++sum2_hi;
sum3_lo += ptr[2]; if(carry) ++sum3_hi;
ptr += 3;
} while(--count);
Regrettably the unrolling is pessimization here as the processor has run out of registers and is forced to shunt sum1_lo
/sum2_lo
and count
into memory. An unrolling factor of two would have been the correct choice, and even no unrolling at all faster.
A vectorized version using parallel addition on native 64-bit integers would still have been possible. However this requires the source data to be unpacked first. Something along these lines:
_m128i zero = __mm_setzero_epi128();
do {
_m128i data = _mm_loadu_si128(*ptr++);
sum1 = _mm_add_epi64(sum1, _mm_unpacklo_epi64(data, zero));
sum2 = _mm_add_epi64(sum2, _mm_unpackhi_epi64(data, zero));
} while(--count);
I have omitted the code but the folding of intermediate results is also needlessly computed within the outer loop rather than waiting for the end of the function. Or better yet doing a single combined loop over the raw COL_SIZE*ROW_SIZE
array.
So then what is going wrong here? Well, modern optimizers are complex beasts full of heuristics and lacking insight we can mostly only speculate.
However a simplified model is that they are structured into passes starting from a high-level representation and gradually applying transformations into a lower-level form which is hoped to turn out more efficient. Regrettably when a relatively high-level transforms such as unrolling takes place the low-level register allocation has may not yet have taken place, and so it is largely forced to guess at a suitable unrolling factor.
It has also been my experience that emulated wide-integer arithmetic rarely receives the most love and is frequently shunted off to generic fallbacks and poorly integrated with the other transforms.
Furthermore vectorization in particular is a difficult process and is generally applied when the code falls into one of the patterns known by the compiler.
In practice the consequence of amounts to a last-straw effect a minor alterations of previously efficient code may fall just beyond the complexity horizon of the optimizer. Instead of going back and forth indefinitely between the passes until an optimal result is reached the compiler is forced to rely on guesswork for performance reasons and will eventually cop out, at which point the house of cards may come tumbling down.
Therefore if your project relies on code-generation then it is up to you to perform due diligence by periodically reviewing the output and testing for regressions, and in case of critical innerloops it is generally advisable to explicitly type out something closer to the expected final result than to rely on fallible transformations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With