Evaluating sizeof causes improvement in performance C

Question

I have a dynamic array and there is a hot loop that spends a lot of time adding a lot of elements to the dynamic array I have implemented.

The dynamic array works by storing a header at the start of the array of values:

typedef struct
{
    unsigned char* begin; // Pointer to first element (should just point to end of struct, the header is stored prior data)
    unsigned char* end; // Pointer to last element
    size_t typeSize; // Size of type that the array stores
    size_t capacity; // capacity of array (when size ((end - begin) / typeSize) exceeds capacity, realloc)
} dynamicArr;

For some reason the loop runs faster when I compare the sizeof item to the typeSize, and its by a large margin (30% increase) as to when I don't make the comparison.

Note, the comparison is only there to safe keep against adding an item of different type and causing misalignment in the array. This should not happen anyway if the dynamic array is used properly to store 1 type and thus in practice should always evaluate to true.

Here's the code for adding an element to the list:

if (arr)
{
    dynamicArr* header = dynamicArr_Header(arr);
    if (header->typeSize && header->typeSize == sizeof item) //If I remove "header->typeSize == sizeof item" performance decreases.
    {
        size_t arrSize = dynamicArr_Size(header);
        if (arrSize == header->capacity)
        {
            size_t newCapacity = (size_t)(header->capacity * 1.5f);
            if (newCapacity == header->capacity) ++newCapacity;
            void* tmp = realloc(header, sizeof(dynamicArr) + header->typeSize * newCapacity);
            if (tmp)
            {
                dynamicArr_Construct(header, tmp, newCapacity, arrSize, header->typeSize);
                *((void**)&(arr)) = header->begin;
                arr[arrSize] = item;
                header->end += header->typeSize;
            }
            else 
            { 
                free(header); 
                arr = NULL; 
            }
        }
        else
        {
            arr[arrSize] = item;
            header->end += header->typeSize;
        }
    }
}

I don't understand why this was the case. I'm not very good at reading assembly either, from what I can see though they are very different, so if someone could help me out here that would be much appreciated!

(Compiled in MSVC with /O2 and /Tc)

Link to the assembly and the rest of the relevant code.

Edit 1: A lot of people seem to think that the reason is because sizeof item is simply evaluated at compile time. I don't think that this is the case because if I remove the condition and replace all instances of header->typeSize with sizeof item the performance is still worse than if the if condition was there. => I seem to have missed changing the use of header->typeSize in the macro dynamicArr_Size which caused this confusion, refer to the marked answer.

Here is the full code:

typedef struct
{
    unsigned char* begin; // Pointer to data
    unsigned char* end; // Pointer to last element
    size_t typeSize; // Size of type
    size_t capacity; // Capacity of array (not number of elements in array)
} dynamicArr;

#define dynamicArr_ConstructBase(dest, src, newCapacity) dest = src; dest->capacity = newCapacity; dest->begin = (unsigned char*)dest + sizeof *dest
#define dynamicArr_Construct(dest, src, newCapacity, currSize, typeSize) dynamicArr_ConstructBase(dest, src, newCapacity); dest->end = dest->begin + typeSize * (currSize)

#define dynamicArr_Header(arr) ((dynamicArr*)((unsigned char*)(arr) - sizeof(dynamicArr)))
static inline size_t dynamicArr_Size(dynamicArr* arr)
{
    return (arr->end - arr->begin) / arr->typeSize;
}

#define dynamicArr_Create(typename, arr) typename* arr = (typename*)dynamicArr_Create_(sizeof(typename))
static inline unsigned char* dynamicArr_Create_(size_t typeSize)
{
    dynamicArr* dynArr;
    void* tmp = malloc(sizeof * dynArr + typeSize * 10);
    if (!tmp) return NULL;

    dynArr = tmp;
    dynArr->begin = (unsigned char*)dynArr + sizeof * dynArr;
    dynArr->end = dynArr->begin;
    dynArr->capacity = 10;
    dynArr->typeSize = typeSize;

    return dynArr->begin;
}

#define dynamicArr_Free(arr) free(dynamicArr_Header(arr))

#define dynamicArr_Push(arr, item) \
do {\
if (arr) \
{ \
    dynamicArr* header = dynamicArr_Header(arr); \
    if (header->typeSize && header->typeSize == sizeof item) \
    { \
        size_t arrSize = dynamicArr_Size(header); \
        if (arrSize == header->capacity) \
        { \
            size_t newCapacity = (size_t)(header->capacity * 1.5f); \
            if (newCapacity == header->capacity) ++newCapacity; \
            void* tmp = realloc(header, sizeof(dynamicArr) + header->typeSize * newCapacity); \
            if (tmp) \
            { \
                dynamicArr_Construct(header, tmp, newCapacity, arrSize, header->typeSize); \
                *((void**)&(arr)) = header->begin; \
                arr[arrSize] = item; \
                header->end += header->typeSize; \
            } \
            else  \
            {  \
                free(header);  \
                arr = NULL;  \
            } \
        } \
        else \
        { \
            arr[arrSize] = item; \
            header->end += header->typeSize; \
        } \
    } \
} \
} while(0)

And example use:

void Func()
{
    dynamicArr_Create(int, intArr);
    dynamicArr_Push(intArr, 10);
    printf("%i
", intArr[0]);
    dynamicArr_Free(intArr);
}

As for a simple test for profiling:

int main()
{
    dynamicArr_Create(int, intArr);

    clock_t begin = clock();

    for (int i = 0; i < 1000000000; ++i)
    {
        dynamicArr_Push(intArr, 10);
    }

    clock_t end = clock();
    double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("%f
", time_spent);

    dynamicArr_Free(intArr);
}

Compiling in release mode in Visual Studio 2019 on windows using /Tc I get the results:

With header->typeSize == sizeof item => 3.65 seconds
Without header->typeSize == sizeof item => 9.374 seconds
Replacing header->typeSize with sizeof item and removing header->typeSize == sizeof item => 9.302 seconds

I repeated the test 10 times and it was consistent to the results above.

user3840170 · Accepted Answer

This is pretty straightforward constant propagation, which enables the compiler to use more efficient instructions.

Since the value of sizeof is a constant known at compile time, in the version with the comparison the compiler knows the value of typeSize and therefore can use more efficient instructions. Here’s for example the computation of dynamicArr_Size(header), which is to say, (header->end - header->begin) / header->typeSize:

--- without sizeof
+++ with sizeof
-        mov     rax, QWORD PTR [rbx+8]
-        xor     edx, edx
-        sub     rax, QWORD PTR [rbx]
-        div     r8
+        mov     rdi, QWORD PTR [rbx+8]
+        sub     rdi, QWORD PTR [rbx]
+        shr     rdi, 2

In the version without sizeof, the compiler has to treat the element size as an unknown and use the actual divide instruction (which also requires zeroing the rdx register, as the instruction takes a 128-bit dividend in the rdx:rax register pair). When the size is known, the compiler can substitute a much faster bit shift instead and avoid touching rdx. This is probably the most impactful difference, as the divide instruction tends to be incredibly slow relative to other arithmetic; most compilers, whenever they have a chance (when the divisor is constant), will instead use bit shifts or wrap-around multiplication tricks just to avoid using the division instruction. (Matt Godbolt has a whole section about division in his talk about compiler optimisations.) More efficient use of registers also frees up registers to be used in other places and may prevent spilling values into memory (though in this case it doesn’t seem to make much of a difference).

Another example, here’s how sizeof(dynamicArr) + header->typeSize * newCapacity is computed:

--- without sizeof
+++ with sizeof
-        mov     rdx, rsi
-        imul    rdx, r8
-        add     rdx, 32
+        lea     rdx, QWORD PTR [rsi*4+32]

In the version without comparison against sizeof, the compiler has to treat header->typeSize as an unknown and use a generic multiplication instruction. When the size is known, it can instead use a lea instruction with a special addressing mode that allows to compute value in a single instruction. Though generic multiplication is not as slow as division, a bit shift will still beat it. But even if lea were not necessarily faster in itself, the higher code density allows more code to fit into the instruction cache and avoid slowdowns due to cache misses.

Finally, you claimed that

A lot of people seem to think that the reason is because sizeof item is simply evaluated at compile time. I don't think that this is the case because if I remove the condition and replace all instances of header->typeSize with sizeof item the performance is still worse than if the if condition was there.

I assume you have only replaced direct uses of the field inside the macro and not the indirect use inside the dynamicArr_Size utility function. When you replace that use as well, the generated code is nearly identical, modulo label names and the immediate value in the if condition check. With comparison against sizeof, the compiler simply does that automatically while inlining that function.

Evaluating sizeof causes improvement in performance C

Tags:

c

optimization

compiler-optimization

x86-64

name

Video Answer

1 Answers

user3840170

Recent Activity

Donate For Us

Evaluating sizeof causes improvement in performance C

Tags:

c

optimization

compiler-optimization

x86-64

name

Video Answer

1 Answers

user3840170

Related questions

Recent Activity

Donate For Us