Why is data type having an effect on performance in this particular case?

Tags:

I've written the following code for benchmarking the effect of cache misses on performance:

#include <chrono>
#include <cstdint>
#include <cstring>
#include <iostream>

// Avoiding using power of 2 because of possible performance degradation due to cache associativity?
static const size_t ROW_SIZE   = 600;
static const size_t COL_SIZE   = 600;
static const size_t TEST_COUNT = 50;

#define SAME_TYPES     1
#define INIT_TO_ONE    0

#if SAME_TYPES
#define ARY_TYPE uint64_t
#define RET_TYPE uint64_t
#else
#define ARY_TYPE uint32_t
#define RET_TYPE uint64_t
#endif

RET_TYPE sum_array_rows(const ARY_TYPE *a, const size_t step)
{
    RET_TYPE sum = 0;
    for (size_t row = 0; row < ROW_SIZE; row++)
        for (size_t col = 0; col < COL_SIZE; col++)
            sum += static_cast<RET_TYPE>(a[row * step + col]);

    return sum;
}

RET_TYPE sum_array_cols(const ARY_TYPE *a, const size_t step)
{
    RET_TYPE sum = 0;
    for (size_t col = 0; col < COL_SIZE; col++)
        for (size_t row = 0; row < ROW_SIZE; row++)
            sum += static_cast<RET_TYPE>(a[row * step + col]);

    return sum;
}

int main()
{
#if INIT_TO_ONE
    ARY_TYPE *a = new ARY_TYPE[ROW_SIZE * COL_SIZE];
    for (int i = 0; i < ROW_SIZE * COL_SIZE; i++) a[i] = 1;
#else
    ARY_TYPE *a = new ARY_TYPE[ROW_SIZE * COL_SIZE]();
#endif

    std::chrono::high_resolution_clock hrc;

    std::cout << "SAME_TYPES:  " << SAME_TYPES << "\n";
    std::cout << "INIT_TO_ONE: " << INIT_TO_ONE << "\n";
    std::cout << "ROW_SIZE:    " << ROW_SIZE << "\n";
    std::cout << "COL_SIZE:    " << COL_SIZE << "\n\n";

    {
        RET_TYPE sum = 0;
        auto start = hrc.now();

        for (int i = 0; i < TEST_COUNT; i++) sum = sum_array_rows(a, COL_SIZE);

        auto end = hrc.now();
        std::cout << "Time taken: " << (end - start).count() / TEST_COUNT << "\n";
        std::cout << "Sum:        " << sum << "\n";
    }

    // I've added this because I want to trash the cache
    // I'm not sure if this is necessary or if it's doing any good...
    ARY_TYPE *other = new ARY_TYPE[ROW_SIZE * COL_SIZE];
    for (int i = 0; i < ROW_SIZE * COL_SIZE; i++) other[i] = 1;

    {
        RET_TYPE sum = other[ROW_SIZE] - 1;
        auto start = hrc.now();

        for (int i = 0; i < TEST_COUNT; i++) sum = sum_array_cols(a, COL_SIZE);

        auto end = hrc.now();
        std::cout << "Time taken: " << (end - start).count() / TEST_COUNT << "\n";
        std::cout << "Sum:        " << sum << "\n";
    }

    return 0;
}

I have two functions sum_array_rows and sum_array_cols which take in an array and add the elements, and return the sum. The difference is the order in which we access the elements. The return type of both functions is always uint64_t.

I have a define near the top of the file called SAME_TYPES. If SAME_TYPES then both return type and type of elements is uint64_t. If not SAME_TYPES then type of elements is uint32_t.

So running this code gives me...

With SAME_TYPES set to 1:

SAME_TYPES:  1
INIT_TO_ONE: 0
ROW_SIZE:    600
COL_SIZE:    600

Time taken: 80948  (element type is uint64_t, ROW first)
Sum:        0
Time taken: 566241 (element type is uint64_t, COL first)
Sum:        0

With SAME_TYPES set to 0:

SAME_TYPES:  0
INIT_TO_ONE: 0
ROW_SIZE:    600
COL_SIZE:    600

Time taken: 283348 (element type is uint32_t, ROW first)
Sum:        0
Time taken: 369653 (element type is uint32_t, COL first)
Sum:        0

I'm confused why this is having any effect on performance. When the data type is the same then the result seems expected, row-col is faster than col-row. But why, when the types are different, do I see a decrease in row-col and an increase in col-row. I understand that if the data type of the element is larger then I can fit less into my cache, but then why am I seeing an increase in performance when iterating over columns in the outer loop. Can someone please explain this to me?

In case it matters, I'm using VS 2015 for compiling this and my CPU is i5-4460 (running Windows 10).

Edit: I guess I shouldn't have blindly trusted the compiler. I was originally compiling using /02 which is the default setting. When I compile with no optimisation I get expected behaviour:

With SAME_TYPES set to 1:

SAME_TYPES:  1
INIT_TO_ONE: 0
ROW_SIZE:    600
COL_SIZE:    600

Time taken: 1009518 (element type is uint64_t, ROW first)
Sum:        0
Time taken: 1504863 (element type is uint64_t, COL first)
Sum:        0

With SAME_TYPES set to 0:

SAME_TYPES:  0
INIT_TO_ONE: 0
ROW_SIZE:    600
COL_SIZE:    600

Time taken: 909479  (element type is uint32_t, ROW first)
Sum:        0
Time taken: 1244492 (element type is uint32_t, COL first)
Sum:        0

The data type still has an effect but it now seems reasonable. Here is the assembly when compiling with optimisation on:

Optimisation ... /O2
SAME_TYPE ...... 1

    RET_TYPE sum_array_rows(const ARY_TYPE *a, const size_t step)
    {
        00FD10C0  xorps       xmm2,xmm2  
            RET_TYPE sum = 0;
        00FD10C3  lea         eax,[ecx+10h]  
        00FD10C6  movaps      xmm1,xmm2  
        00FD10C9  mov         edx,258h  
        00FD10CE  xchg        ax,ax  
                for (size_t col = 0; col < COL_SIZE; col++)
        00FD10D0  mov         ecx,96h  
                    sum += static_cast<RET_TYPE>(a[row * step + col]);
        00FD10D5  movups      xmm0,xmmword ptr [eax-10h]  
        00FD10D9  paddq       xmm2,xmm0  
        00FD10DD  movups      xmm0,xmmword ptr [eax]  
        00FD10E0  add         eax,20h  
        00FD10E3  paddq       xmm1,xmm0  
        00FD10E7  sub         ecx,1  
        00FD10EA  jne         sum_array_rows+15h (0FD10D5h)  
            for (size_t row = 0; row < ROW_SIZE; row++)
        00FD10EC  sub         edx,1  
            for (size_t row = 0; row < ROW_SIZE; row++)
        00FD10EF  jne         sum_array_rows+10h (0FD10D0h)  

            return sum;
        }
        00FD10F1  paddq       xmm1,xmm2  
        00FD10F5  movaps      xmm0,xmm1  
        00FD10F8  psrldq      xmm0,8  
        00FD10FD  paddq       xmm1,xmm0  
        00FD1101  movd        eax,xmm1  
        00FD1105  psrldq      xmm1,4  
        00FD110A  movd        edx,xmm1  
        00FD110E  ret  


    RET_TYPE sum_array_cols(const ARY_TYPE *a, const size_t step)
    {
        00FD1110  push        ebp  
        00FD1111  mov         ebp,esp  
        00FD1113  sub         esp,24h  
        00FD1116  push        ebx  
        00FD1117  xorps       xmm0,xmm0  
            RET_TYPE sum = 0;
        00FD111A  mov         dword ptr [ebp-10h],258h  
        00FD1121  push        esi  
        00FD1122  movlpd      qword ptr [ebp-18h],xmm0  
        00FD1127  lea         eax,[ecx+2580h]  
        00FD112D  mov         edx,dword ptr [ebp-14h]  
        00FD1130  push        edi  
        00FD1131  mov         edi,dword ptr [sum]  
        00FD1134  mov         dword ptr [ebp-0Ch],eax  
        00FD1137  nop         word ptr [eax+eax]  
                for (size_t row = 0; row < ROW_SIZE; row++)
        00FD1140  xorps       xmm0,xmm0  
        00FD1143  mov         dword ptr [ebp-8],0C8h  
        00FD114A  movlpd      qword ptr [sum],xmm0  
        00FD114F  mov         ecx,dword ptr [ebp-18h]  
        00FD1152  mov         ebx,dword ptr [ebp-14h]  
        00FD1155  movlpd      qword ptr [ebp-20h],xmm0  
        00FD115A  mov         esi,dword ptr [ebp-20h]  
        00FD115D  mov         dword ptr [ebp-4],ecx  
        00FD1160  mov         ecx,dword ptr [ebp-1Ch]  
        00FD1163  nop         dword ptr [eax]  
        00FD1167  nop         word ptr [eax+eax]  
                    sum += static_cast<RET_TYPE>(a[row * step + col]);
        00FD1170  add         edi,dword ptr [eax-2580h]  
        00FD1176  mov         dword ptr [sum],edi  
        00FD1179  adc         edx,dword ptr [eax-257Ch]  
        00FD117F  mov         edi,dword ptr [ebp-4]  
        00FD1182  add         edi,dword ptr [eax-12C0h]  
        00FD1188  mov         dword ptr [ebp-4],edi  
        00FD118B  adc         ebx,dword ptr [eax-12BCh]  
        00FD1191  add         esi,dword ptr [eax]  
        00FD1193  mov         edi,dword ptr [sum]  
        00FD1196  adc         ecx,dword ptr [eax+4]  
        00FD1199  lea         eax,[eax+3840h]  
        00FD119F  sub         dword ptr [ebp-8],1  
        00FD11A3  jne         sum_array_cols+60h (0FD1170h)  
            for (size_t col = 0; col < COL_SIZE; col++)
        00FD11A5  add         esi,dword ptr [ebp-4]  
        00FD11A8  mov         eax,dword ptr [ebp-0Ch]  
        00FD11AB  adc         ecx,ebx  
        00FD11AD  add         edi,esi  
        00FD11AF  adc         edx,ecx  
        00FD11B1  add         eax,8  
        00FD11B4  sub         dword ptr [ebp-10h],1  
        00FD11B8  mov         dword ptr [ebp-0Ch],eax  
        00FD11BB  jne         sum_array_cols+30h (0FD1140h)  

            return sum;
        00FD11BD  mov         eax,edi  
        }
        00FD11BF  pop         edi  
        00FD11C0  pop         esi  
        00FD11C1  pop         ebx  
        00FD11C2  mov         esp,ebp  
        00FD11C4  pop         ebp  
        00FD11C5  ret  


================

Optimisation ... /O2
SAME_TYPE ...... 0

    RET_TYPE sum_array_rows(const ARY_TYPE *a, const size_t step)
    {
        00A110C0  push        ebp  
        00A110C1  mov         ebp,esp  
        00A110C3  sub         esp,24h  
        00A110C6  push        ebx  
        00A110C7  xorps       xmm0,xmm0  
            RET_TYPE sum = 0;
        00A110CA  mov         dword ptr [ebp-0Ch],258h  
        00A110D1  movlpd      qword ptr [ebp-18h],xmm0  
        00A110D6  mov         edx,dword ptr [ebp-14h]  
        00A110D9  mov         eax,dword ptr [sum]  
        00A110DC  push        esi  
        00A110DD  push        edi  
        00A110DE  xchg        ax,ax  
                for (size_t col = 0; col < COL_SIZE; col++)
        00A110E0  xorps       xmm0,xmm0  
        00A110E3  mov         dword ptr [ebp-8],0C8h  
        00A110EA  movlpd      qword ptr [sum],xmm0  
        00A110EF  mov         esi,dword ptr [ebp-18h]  
        00A110F2  mov         ebx,dword ptr [ebp-14h]  
                for (size_t col = 0; col < COL_SIZE; col++)
        00A110F5  movlpd      qword ptr [ebp-20h],xmm0  
        00A110FA  mov         edi,dword ptr [ebp-20h]  
        00A110FD  mov         dword ptr [ebp-4],esi  
        00A11100  mov         esi,dword ptr [ebp-1Ch]  
                    sum += static_cast<RET_TYPE>(a[row * step + col]);
        00A11103  add         eax,dword ptr [ecx]  
        00A11105  adc         edx,0  
        00A11108  mov         dword ptr [sum],edx  
        00A1110B  mov         edx,dword ptr [ebp-4]  
        00A1110E  add         edx,dword ptr [ecx+4]  
        00A11111  mov         dword ptr [ebp-4],edx  
        00A11114  mov         edx,dword ptr [sum]  
        00A11117  adc         ebx,0  
        00A1111A  add         edi,dword ptr [ecx+8]  
        00A1111D  adc         esi,0  
        00A11120  add         ecx,0Ch  
        00A11123  sub         dword ptr [ebp-8],1  
        00A11127  jne         sum_array_rows+43h (0A11103h)  
            for (size_t row = 0; row < ROW_SIZE; row++)
        00A11129  add         edi,dword ptr [ebp-4]  
        00A1112C  adc         esi,ebx  
        00A1112E  add         eax,edi  
        00A11130  adc         edx,esi  
        00A11132  sub         dword ptr [ebp-0Ch],1  
        00A11136  jne         sum_array_rows+20h (0A110E0h)  

            return sum;
        }
        00A11138  pop         edi  
        00A11139  pop         esi  
        00A1113A  pop         ebx  
        00A1113B  mov         esp,ebp  
        00A1113D  pop         ebp  
        00A1113E  ret  


    RET_TYPE sum_array_cols(const ARY_TYPE *a, const size_t step)
    {
        00A11140  push        ebp  
        00A11141  mov         ebp,esp  
        00A11143  sub         esp,24h  
        00A11146  push        ebx  
        00A11147  xorps       xmm0,xmm0  
            RET_TYPE sum = 0;
        00A1114A  mov         dword ptr [ebp-10h],258h  
        00A11151  push        esi  
        00A11152  movlpd      qword ptr [ebp-18h],xmm0  
        00A11157  lea         eax,[ecx+12C0h]  
        00A1115D  mov         edx,dword ptr [ebp-14h]  
        00A11160  push        edi  
        00A11161  mov         edi,dword ptr [sum]  
        00A11164  mov         dword ptr [ebp-0Ch],eax  
        00A11167  nop         word ptr [eax+eax]  
                for (size_t row = 0; row < ROW_SIZE; row++)
        00A11170  xorps       xmm0,xmm0  
        00A11173  mov         dword ptr [ebp-8],0C8h  
        00A1117A  movlpd      qword ptr [sum],xmm0  
        00A1117F  mov         ecx,dword ptr [ebp-18h]  
        00A11182  mov         ebx,dword ptr [ebp-14h]  
        00A11185  movlpd      qword ptr [ebp-20h],xmm0  
        00A1118A  mov         esi,dword ptr [ebp-20h]  
        00A1118D  mov         dword ptr [ebp-4],ecx  
        00A11190  mov         ecx,dword ptr [ebp-1Ch]  
        00A11193  nop         dword ptr [eax]  
        00A11197  nop         word ptr [eax+eax]  
                    sum += static_cast<RET_TYPE>(a[row * step + col]);
        00A111A0  add         edi,dword ptr [eax-12C0h]  
        00A111A6  lea         eax,[eax+1C20h]  
        00A111AC  adc         edx,0  
        00A111AF  mov         dword ptr [sum],edx  
        00A111B2  mov         edx,dword ptr [ebp-4]  
        00A111B5  add         edx,dword ptr [eax-2580h]  
        00A111BB  mov         dword ptr [ebp-4],edx  
        00A111BE  mov         edx,dword ptr [sum]  
        00A111C1  adc         ebx,0  
        00A111C4  add         esi,dword ptr [eax-1C20h]  
        00A111CA  adc         ecx,0  
        00A111CD  sub         dword ptr [ebp-8],1  
        00A111D1  jne         sum_array_cols+60h (0A111A0h)  
            for (size_t col = 0; col < COL_SIZE; col++)
        00A111D3  add         esi,dword ptr [ebp-4]  
        00A111D6  mov         eax,dword ptr [ebp-0Ch]  
        00A111D9  adc         ecx,ebx  
        00A111DB  add         edi,esi  
        00A111DD  adc         edx,ecx  
        00A111DF  add         eax,4  
        00A111E2  sub         dword ptr [ebp-10h],1  
        00A111E6  mov         dword ptr [ebp-0Ch],eax  
        00A111E9  jne         sum_array_cols+30h (0A11170h)  

            return sum;
        00A111EB  mov         eax,edi  
        }
        00A111ED  pop         edi  
        }
        00A111EE  pop         esi  
        00A111EF  pop         ebx  
        00A111F0  mov         esp,ebp  
        00A111F2  pop         ebp  
        00A111F3  ret

342

asked Aug 14 '17 18:08

Izaan

1 Answers

The short answer is that the optimizer is trying to be clever but its heuristics are failing in the case with disparate type sizes, which ends up slowing down the code.

Let begin by reviewing the code generated for the innerloop of sum_array_rows with 64-bit source data.

innerloop:
    movups xmm0,[eax-0x10]
    paddq xmm2,xmm0
    movups xmm0,[eax]
    add eax,0x20
    paddq xmm1,xmm0
    sub ecx,1
    jne innerloop

This is roughly equivalent to the following code C using intrinsics.

do {
    sum1 = _mm_add_epi64(sum1, _mm_loadu_si128(&ptr[0]));
    sum2 = _mm_add_epi64(sum2, _mm_loadu_si128(&ptr[1]));
    ptr += 2;
} while(--count);

What we see here is that the optimizer has determined that addition is associative and consequently unrolled the loop onto four parallel accumulators, which eventually get summed together at the end of the loop. This permits the independent computations to be executed in parallel by the CPU and more importantly permits vectorization by using the SSE2 instruction set to add pairs of 64-bit integers in a single instruction.

This is a good result.

On the other side of the spectrum we have the version with 64-bit accumulation of 32-bit source data:

innerloop:
    add eax,[ecx]
    adc edx,0
    mov [sum],edx
    mov edx,[ebp-4]
    add edx,[ecx+4]
    mov [ebp-4],edx
    mov edx,[sum]
    adc ebx,0
    add edi,[ecx+8]
    adc esi,0
    add ecx,12
    sub [ebp-8],1
    jne innerloop

Note that the 32-bit x86 target lacks native support for 64-bit arithmatic using the normal (non vector) instruction set. Consequently each accumulator is split up into individual upper and lower word variables, with carries from the lower word manually propagated into the upper word.

Furthermore the loop is unrolled three times instead of four.

The pseudo-C version reads roughly like this:

do {
    sum1_lo += ptr[0]; if(carry) ++sum1_hi;
    sum2_lo += ptr[1]; if(carry) ++sum2_hi;
    sum3_lo += ptr[2]; if(carry) ++sum3_hi;
    ptr += 3;
} while(--count);

Regrettably the unrolling is pessimization here as the processor has run out of registers and is forced to shunt sum1_lo/sum2_lo and count into memory. An unrolling factor of two would have been the correct choice, and even no unrolling at all faster.

A vectorized version using parallel addition on native 64-bit integers would still have been possible. However this requires the source data to be unpacked first. Something along these lines:

_m128i zero = __mm_setzero_epi128();
do {
    _m128i data = _mm_loadu_si128(*ptr++);
    sum1 = _mm_add_epi64(sum1, _mm_unpacklo_epi64(data, zero));
    sum2 = _mm_add_epi64(sum2, _mm_unpackhi_epi64(data, zero));
} while(--count);

I have omitted the code but the folding of intermediate results is also needlessly computed within the outer loop rather than waiting for the end of the function. Or better yet doing a single combined loop over the raw COL_SIZE*ROW_SIZE array.

So then what is going wrong here? Well, modern optimizers are complex beasts full of heuristics and lacking insight we can mostly only speculate.

However a simplified model is that they are structured into passes starting from a high-level representation and gradually applying transformations into a lower-level form which is hoped to turn out more efficient. Regrettably when a relatively high-level transforms such as unrolling takes place the low-level register allocation has may not yet have taken place, and so it is largely forced to guess at a suitable unrolling factor.

It has also been my experience that emulated wide-integer arithmetic rarely receives the most love and is frequently shunted off to generic fallbacks and poorly integrated with the other transforms.

Furthermore vectorization in particular is a difficult process and is generally applied when the code falls into one of the patterns known by the compiler.

In practice the consequence of amounts to a last-straw effect a minor alterations of previously efficient code may fall just beyond the complexity horizon of the optimizer. Instead of going back and forth indefinitely between the passes until an optimal result is reached the compiler is forced to rely on guesswork for performance reasons and will eventually cop out, at which point the house of cards may come tumbling down.

Therefore if your project relies on code-generation then it is up to you to perform due diligence by periodically reviewing the output and testing for regressions, and in case of critical innerloops it is generally advisable to explicitly type out something closer to the expected final result than to rely on fallible transformations.

answered Oct 13 '22 00:10

doynax

Related questions
                            
                                Complex filtering in dask DataFrame
                            
                                ValueError: Images of type float must be between -1 and 1
                            
                                conda installed packages not working with jupyter
                            
                                Send simple HTTP request with HTML submit button
                            
                                Converting Strings to Template Literals in ES6
                            
                                Single Image Inference in Tensorflow [Python]
                            
                                Jenkins re-run setup wizard
                            
                                fastest way to do fuzzy matching two strings in pandas data frame
                            
                                Url validation attribute marks `localhost` as invalid Url
                            
                                Google font Rubik not working or am I missing something?
                            
                                Fabricjs Fixed TextBox
                            
                                Flexbox wrapping nested rows without media query

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With