Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom allocator performance

I'm building an AVL tree class which will have a fixed maximum number of items. So I thought instead of allocating each item by itself, I'd just allocate the entire chunk at once and use a bitmap to assign new memory when needed.

My allocation / deallocation code:

avltree::avltree(UINT64 numitems)
{
  root = NULL;

  if (!numitems)
    buffer = NULL;
  else {
    UINT64 memsize = sizeof(avlnode) * numitems + bitlist::storagesize(numitems);
    buffer = (avlnode *) malloc(memsize);
    memmap.init(numitems, buffer + numitems);
    memmap.clear_all();
    freeaddr = 0;
  }
}

avlnode *avltree::newnode(keytype key)
{
  if (!buffer)
    return new avlnode(key);
  else 
  {
    UINT64 pos;
    if (freeaddr < memmap.size_bits)
      pos = freeaddr++;
    else
      pos = memmap.get_first_unset();
    memmap.set_bit(pos);
    return new (&buffer[pos]) avlnode(key);
  }
}

void avltree::deletenode(avlnode *node)
{
  if (!buffer)
    delete node;
  else
    memmap.clear_bit(node - buffer);
}

In order to use standard new / delete, I have to construct the tree with numitems == 0. In order to use my own allocator, I just pass number of items. All functions are inlined for maximum performance.

This is all fine and dandy, but my own allocator is abut 20% slower than new / delete. Now, I know how complex memory allocators are, there's no way that code can run faster than an array lookup + one bit set, but that is exactly the case here. What's worse: my deallocator is slower even if I remove all code from it?!?

When I check assembly output, my allocator's code path is ridden with QWORD PTR instructions dealing with bitmap, avltree or avlnode. It doesn't seem to be much different for the new / delete path.

For example, assembly output of avltree::newnode:

;avltree::newnode, COMDAT
mov     QWORD PTR [rsp+8], rbx
push    rdi
sub     rsp, 32

;if (!buffer)
cmp     QWORD PTR [rcx+8], 0
mov     edi, edx
mov     rbx, rcx
jne     SHORT $LN4@newnode

;  return new avlnode(key);
mov     ecx, 24
call    ??2@YAPEAX_K@Z         ; operator new
jmp     SHORT $LN27@newnode

;$LN4@newnode:
;else {
;  UINT64 pos;
;  if (freeaddr < memmap.size_bits)
mov     r9, QWORD PTR [rcx+40]
cmp     r9, QWORD PTR [rcx+32]
jae     SHORT $LN2@newnode

;    pos = freeaddr++;
lea     rax, QWORD PTR [r9+1]
mov     QWORD PTR [rcx+40], rax

;  else
jmp     SHORT $LN1@newnode
$LN2@newnode:

;    pos = memmap.get_first_unset();
add     rcx, 16
call    ?get_first_unset@bitlist@@QEAA_KXZ ; bitlist::get_first_unset
mov     r9, rax

$LN1@newnode:
; memmap.set_bit(pos);
mov     rcx, QWORD PTR [rbx+16]                    ;data[bindex(pos)] |= bmask(pos);
mov     rdx, r9                                    ;return pos / (sizeof(BITINT) * 8);
shr     rdx, 6
lea     r8, QWORD PTR [rcx+rdx*8]                  ;data[bindex(pos)] |= bmask(pos);
movzx   ecx, r9b                                   ;return 1ull << (pos % (sizeof(BITINT) * 8));
mov     edx, 1
and     cl, 63
shl     rdx, cl

;   return new (&buffer[pos]) avlnode(key);
lea     rcx, QWORD PTR [r9+r9*2]
; File c:\projects\vvd\vvd\util\bitlist.h
or      QWORD PTR [r8], rdx                        ;data[bindex(pos)] |= bmask(pos)

; 195  :     return new (&buffer[pos]) avlnode(key);
mov     rax, QWORD PTR [rbx+8]
lea     rax, QWORD PTR [rax+rcx*8]
; $LN27@newnode:
test    rax, rax
je      SHORT $LN9@newnode

; avlnode constructor; 
mov     BYTE PTR [rax+4], 1
mov     QWORD PTR [rax+8], 0
mov     QWORD PTR [rax+16], 0
mov     DWORD PTR [rax], edi

; 196  :   }
; 197  : }
; $LN9@newnode:
mov     rbx, QWORD PTR [rsp+48]
add     rsp, 32                        ; 00000020H
pop     rdi
ret     0
?newnode@avltree@@QEAAPEAUavlnode@@H@Z ENDP             ; avltree::newnode
_TEXT   ENDS

I have checked multiple times the output of compilation when I construct my avltree with default / custom allocator and it remains the same in this particular region of code. I have tried removing / replacing all relevant parts to no significant effect.

To be honest, I expected compiler to inline all of this since there are very few variables. I was hoping for everything except the avlnode objects themselves to be placed in registers, but that doesn't seem to be the case.

Yet the speed difference is clearly measurable. I'm not calling 3 seconds per 10 million nodes inserted slow, but I expected my code to be faster, not slower than generic allocator (2.5 seconds). That goes especially for the slower deallocator which is slower even when all code is stripped from it.

Why is it slower?

Edit: Thank you all for excellent thoughts on this. But I would like to stress again that the issue is not so much within my method of allocation as it is in suboptimal way of using the variables: the entire avltree class contains just 4 UINT64 variables, bitlist only has 3.

However, despite that, the compiler doesn't optimise this into registers. It insists on QWORD PTR instructions that are orders of magnitude slower. Is this because I'm using classes? Should I move to C / plain variables? Scratch that. Stupid me. I have all the avltree code in there as well, things can't be in registers.

Also, I am at a total loss why my deallocator would still be slower, even if I remove ALL code from it. Yet QueryPerformanceCounter tells me just that. It's insane to even think that: that same deallocator also gets called for the new / delete code path and it has to delete the node... It doesn't have to do anything for my custom allocator (when I strip the code).

Edit2: I have now completely removed the bitlist and implemented free space tracking through a singly-linked list. The avltree::newnode function is now much more compact (21 instructions for my custom allocator path, 7 of those are QWORD PTR ops dealing with avltree and 4 are used for constructor of avlnode). The end result (time) decreased from ~3 seconds to ~2.95 seconds for 10 million allocations.

Edit3: I also rewrote the entire code such that now everything is handled by the singly linked list. Now the avltree class only has two relevant members: root and first_free. The speed difference remains.

Edit4: Rearranging code and looking at performance figures, these things are what helped the most:

  1. As pointed out by all contributors, having a bitmap in there was just plain bad. Removed in favour of singly-linked free slot list.
  2. Code locality: by adding dependent functions (avl tree handling ones) into a function-local class instead of having them declared globally helped some 15% with code speed (3 secs --> 2.5 secs)
  3. avlnode struct size: just adding #pragma pack(1) before struct declaration decreased execution time a further 20% (2,5 secs --> 2 secs)

Edit 5:

Since this querstion seems to have been quite popular, I have posted the final complete code as an answer below. I'm quite satisfied with its performance.

like image 400
velis Avatar asked Apr 15 '15 12:04

velis


People also ask

What is custom allocator?

A custom allocator is a reasonable way to securely erase memory before it is deallocated.

What does an allocator actually need to do?

The purpose of the allocator is to 2allocate raw memory without construction of objects, as well as simply deallocate memory without the need to destroy them, hence the usage of operator new and operator delete directly is preferred over the usage of the keywords new and delete .

What is a general purpose allocator?

Allocators handle all the requests for allocation and deallocation of memory for a given container. The C++ Standard Library provides general-purpose allocators that are used by default, however, custom allocators may also be supplied by the programmer.

What is use of allocator in C++?

Allocators are used by the C++ Standard Library to handle the allocation and deallocation of elements stored in containers. All C++ Standard Library containers except std::array have a template parameter of type allocator<Type> , where Type represents the type of the container element.


1 Answers

Your method only allocates the raw memory in one chunk and then has to do a placement new for each element. Combine that with all the overhead in your bitmap and its not too surprising that the default new allocation beats yours assuming an empty heap.

To get the most speed improvement when allocating you can allocate the entire object in one large array and then assign to it from there. If you look at a very simple and contrived benchmark:

struct test_t {
    float f;
    int i;
    test_t* pNext;
};

const size_t NUM_ALLOCS = 50000000;

void TestNew (void)
{
    test_t* pPtr = new test_t;

    for (int i = 0; i < NUM_ALLOCS; ++i)
    {
        pPtr->pNext = new test_t;
        pPtr = pPtr->pNext;
    }

}

void TestBucket (void)
{
    test_t* pBuckets = new test_t[NUM_ALLOCS + 2];
    test_t* pPtr = pBuckets++;

    for (int i = 0; i < NUM_ALLOCS; ++i)
    {
        pPtr->pNext = pBuckets++;
        pPtr = pPtr->pNext;
    }

}

With this code on MSVC++ 2013 with 50M allocations TestBucket() outperforms TestNew() by over a factor of x16 (130 vs 2080 ms). Even if you add a std::bitset<> to track allocations it is still x4 faster (400 ms).

An important thing to remember about new is that the time its takes to allocate an object generally depends on the state of the heap. An empty heap will be able to allocate a bunch of constant sized objects like this relatively fast, which is probably one reason why your code seems slower than new. If you have a program that runs for a while and allocates a large number of differently sized objects the heap can become fragmented and allocating objects can take much (much) longer.

As an example, one program I wrote was loading a 200MB file with millions of records...lots of differently sized allocations. On the first load it took ~15 seconds but if I deleted that file and tried to load it again it took something like x10-x20 longer. This was entirely due to memory allocation and switching to a simple bucket/arena allocator fixed the issue. So, that contrived benchmark I did showing a x16 speedup might actually get show a significantly larger difference with a fragmented heap.

It gets even more tricky when you realize that different systems/platforms may use different memory allocation schemes so the benchmark results on one system may be different from another.

To distil this into a few short points:

  1. Benchmarking memory allocation is tricky (performance depends on a lot of things)
  2. In some cases you can get better performance with a custom allocator. In a few cases you can get much better.
  3. Creating a custom allocator can be tricky and takes time to profile/benchmark your specific use case.

Note -- Benchmarks like this aren't meant to be realistic but are useful to determine the upper bound of how fast something can be. It can be used along with the profile/benchmark of your actual code to help determine what should/shouldn't be optimized.

Update -- I can't seem to replicate your results in my code under MSVC++ 2013. Using the same structure as your avlnode and trying a placement new yields the same speed as my non-placement bucket allocator tests (placement new was actually a little bit faster). Using a class similar to your avltree doesn't affect the benchmark much. With 10 million allocations/deallocations I'm getting ~800 ms for the new/delete and ~200ms for the custom allocator (both with and without placement new). While I'm not worried about the difference in absolute times, the relative time difference seems odd.

I would suggest taking a closer look at your benchmark and make sure you are measuring what you think you are. If the code exists in a larger code-base then create a minimal test case to benchmark it. Make sure that your compiler optimizer is not doing something that would invalidate the benchmark (it happens too easily these days).

Note that it would be far easier to answer your question if you had reduced it to a minimal example and included the complete code in the question, including the benchmark code. Benchmarking is one of those things that seems easy but there are a lot of "gotchas" involved in it.

Update 2 -- Including the basic allocator class and benchmark code I'm using so others can try to duplicate my results. Note that this is for testing only and is far from actual working/production code. It is far simpler than your code which may be why we're getting different results.

#include <string>
#include <Windows.h>

struct test_t 
{
    __int64 key;
    __int64 weight;
    __int64 left;
    __int64 right;
    test_t* pNext;     // Simple linked list

    test_t() : key(0), weight(0), pNext(NULL), left(0), right(0) { }
    test_t(const __int64 k) : key(k), weight(0), pNext(NULL), left(0), right(0) { }
};

const size_t NUM_ALLOCS = 10000000;
test_t* pLast; //To prevent compiler optimizations from being "smart"

struct CTest 
{
    test_t* m_pBuffer;
    size_t  m_MaxSize;
    size_t  m_FreeIndex;
    test_t* m_pFreeList;

    CTest(const size_t Size) :
            m_pBuffer(NULL),
            m_MaxSize(Size),
            m_pFreeList(NULL),
            m_FreeIndex(0)
    {
        if (m_MaxSize > 0) m_pBuffer = (test_t *) new char[sizeof(test_t) * (m_MaxSize + 1)];
    }

    test_t* NewNode(__int64 key)
    {
        if (!m_pBuffer || m_FreeIndex >= m_MaxSize) return new test_t(key);

        size_t Pos = m_FreeIndex;
        ++m_FreeIndex;
        return new (&m_pBuffer[Pos]) test_t(key);
    }

    void DeleteNode(test_t* pNode)
    {
        if (!m_pBuffer) {
            delete pNode;
        }
        else
        {
            pNode->pNext = m_pFreeList;
            m_pFreeList = pNode;
        }
    }

};


void TestNew(void)
{
    test_t* pPtr = new test_t;
    test_t* pFirst = pPtr;

    for (int i = 0; i < NUM_ALLOCS; ++i)
    {
        pPtr->pNext = new test_t;
        pPtr = pPtr->pNext; 
    }

    pPtr = pFirst;

    while (pPtr)
    {
        test_t* pTemp = pPtr;
        pPtr = pPtr->pNext;
        delete pTemp;
    }

    pLast = pPtr;    
}


void TestClass(const size_t BufferSize)
{
    CTest Alloc(BufferSize);
    test_t* pPtr = Alloc.NewNode(0);
    test_t* pFirstPtr = pPtr;

    for (int i = 0; i < NUM_ALLOCS; ++i)
    {
        pPtr->pNext = Alloc.NewNode(i);
        pPtr = pPtr->pNext;
    }

    pLast = pPtr;
    pPtr = pFirstPtr;

    while (pPtr != NULL)
    {
        test_t* pTmp = pPtr->pNext;
        Alloc.DeleteNode(pPtr);
        pPtr = pTmp;
    }
}


int main(void)
{
    DWORD StartTick = GetTickCount();
    TestClass(0);
    //TestClass(NUM_ALLOCS + 10);
    //TestNew();
    DWORD EndTick = GetTickCount();

    printf("Time = %u ms\n", EndTick - StartTick);
    printf("Last = %p\n", pLast);

    return 0;
}

Currently I'm getting ~800ms for both TestNew() and TestClass(0) and under 200ms for TestClass(NUM_ALLOCS + 10). The custom allocator is pretty fast as it operates on the memory in a completely linear fashion which allows the memory cache to work its magic. I'm also using GetTickCount() for simplicity and it is accurate enough so long as times are above ~100ms.

like image 84
uesp Avatar answered Oct 05 '22 08:10

uesp