I'm building an AVL tree class which will have a fixed maximum number of items. So I thought instead of allocating each item by itself, I'd just allocate the entire chunk at once and use a bitmap to assign new memory when needed.
My allocation / deallocation code:
avltree::avltree(UINT64 numitems)
{
root = NULL;
if (!numitems)
buffer = NULL;
else {
UINT64 memsize = sizeof(avlnode) * numitems + bitlist::storagesize(numitems);
buffer = (avlnode *) malloc(memsize);
memmap.init(numitems, buffer + numitems);
memmap.clear_all();
freeaddr = 0;
}
}
avlnode *avltree::newnode(keytype key)
{
if (!buffer)
return new avlnode(key);
else
{
UINT64 pos;
if (freeaddr < memmap.size_bits)
pos = freeaddr++;
else
pos = memmap.get_first_unset();
memmap.set_bit(pos);
return new (&buffer[pos]) avlnode(key);
}
}
void avltree::deletenode(avlnode *node)
{
if (!buffer)
delete node;
else
memmap.clear_bit(node - buffer);
}
In order to use standard new / delete, I have to construct the tree with numitems == 0. In order to use my own allocator, I just pass number of items. All functions are inlined for maximum performance.
This is all fine and dandy, but my own allocator is abut 20% slower than new / delete. Now, I know how complex memory allocators are, there's no way that code can run faster than an array lookup + one bit set, but that is exactly the case here. What's worse: my deallocator is slower even if I remove all code from it?!?
When I check assembly output, my allocator's code path is ridden with QWORD PTR instructions dealing with bitmap, avltree or avlnode. It doesn't seem to be much different for the new / delete path.
For example, assembly output of avltree::newnode:
;avltree::newnode, COMDAT
mov QWORD PTR [rsp+8], rbx
push rdi
sub rsp, 32
;if (!buffer)
cmp QWORD PTR [rcx+8], 0
mov edi, edx
mov rbx, rcx
jne SHORT $LN4@newnode
; return new avlnode(key);
mov ecx, 24
call ??2@YAPEAX_K@Z ; operator new
jmp SHORT $LN27@newnode
;$LN4@newnode:
;else {
; UINT64 pos;
; if (freeaddr < memmap.size_bits)
mov r9, QWORD PTR [rcx+40]
cmp r9, QWORD PTR [rcx+32]
jae SHORT $LN2@newnode
; pos = freeaddr++;
lea rax, QWORD PTR [r9+1]
mov QWORD PTR [rcx+40], rax
; else
jmp SHORT $LN1@newnode
$LN2@newnode:
; pos = memmap.get_first_unset();
add rcx, 16
call ?get_first_unset@bitlist@@QEAA_KXZ ; bitlist::get_first_unset
mov r9, rax
$LN1@newnode:
; memmap.set_bit(pos);
mov rcx, QWORD PTR [rbx+16] ;data[bindex(pos)] |= bmask(pos);
mov rdx, r9 ;return pos / (sizeof(BITINT) * 8);
shr rdx, 6
lea r8, QWORD PTR [rcx+rdx*8] ;data[bindex(pos)] |= bmask(pos);
movzx ecx, r9b ;return 1ull << (pos % (sizeof(BITINT) * 8));
mov edx, 1
and cl, 63
shl rdx, cl
; return new (&buffer[pos]) avlnode(key);
lea rcx, QWORD PTR [r9+r9*2]
; File c:\projects\vvd\vvd\util\bitlist.h
or QWORD PTR [r8], rdx ;data[bindex(pos)] |= bmask(pos)
; 195 : return new (&buffer[pos]) avlnode(key);
mov rax, QWORD PTR [rbx+8]
lea rax, QWORD PTR [rax+rcx*8]
; $LN27@newnode:
test rax, rax
je SHORT $LN9@newnode
; avlnode constructor;
mov BYTE PTR [rax+4], 1
mov QWORD PTR [rax+8], 0
mov QWORD PTR [rax+16], 0
mov DWORD PTR [rax], edi
; 196 : }
; 197 : }
; $LN9@newnode:
mov rbx, QWORD PTR [rsp+48]
add rsp, 32 ; 00000020H
pop rdi
ret 0
?newnode@avltree@@QEAAPEAUavlnode@@H@Z ENDP ; avltree::newnode
_TEXT ENDS
I have checked multiple times the output of compilation when I construct my avltree with default / custom allocator and it remains the same in this particular region of code. I have tried removing / replacing all relevant parts to no significant effect.
To be honest, I expected compiler to inline all of this since there are very few variables. I was hoping for everything except the avlnode objects themselves to be placed in registers, but that doesn't seem to be the case.
Yet the speed difference is clearly measurable. I'm not calling 3 seconds per 10 million nodes inserted slow, but I expected my code to be faster, not slower than generic allocator (2.5 seconds). That goes especially for the slower deallocator which is slower even when all code is stripped from it.
Why is it slower?
Edit: Thank you all for excellent thoughts on this. But I would like to stress again that the issue is not so much within my method of allocation as it is in suboptimal way of using the variables: the entire avltree class contains just 4 UINT64 variables, bitlist only has 3.
However, despite that, the compiler doesn't optimise this into registers. It insists on QWORD PTR instructions that are orders of magnitude slower. Is this because I'm using classes? Should I move to C / plain variables? Scratch that. Stupid me. I have all the avltree code in there as well, things can't be in registers.
Also, I am at a total loss why my deallocator would still be slower, even if I remove ALL code from it. Yet QueryPerformanceCounter tells me just that. It's insane to even think that: that same deallocator also gets called for the new / delete code path and it has to delete the node... It doesn't have to do anything for my custom allocator (when I strip the code).
Edit2: I have now completely removed the bitlist and implemented free space tracking through a singly-linked list. The avltree::newnode function is now much more compact (21 instructions for my custom allocator path, 7 of those are QWORD PTR ops dealing with avltree and 4 are used for constructor of avlnode). The end result (time) decreased from ~3 seconds to ~2.95 seconds for 10 million allocations.
Edit3: I also rewrote the entire code such that now everything is handled by the singly linked list. Now the avltree class only has two relevant members: root and first_free. The speed difference remains.
Edit4: Rearranging code and looking at performance figures, these things are what helped the most:
#pragma pack(1)
before struct declaration decreased execution time a further 20% (2,5 secs --> 2 secs)Edit 5:
Since this querstion seems to have been quite popular, I have posted the final complete code as an answer below. I'm quite satisfied with its performance.
A custom allocator is a reasonable way to securely erase memory before it is deallocated.
The purpose of the allocator is to 2allocate raw memory without construction of objects, as well as simply deallocate memory without the need to destroy them, hence the usage of operator new and operator delete directly is preferred over the usage of the keywords new and delete .
Allocators handle all the requests for allocation and deallocation of memory for a given container. The C++ Standard Library provides general-purpose allocators that are used by default, however, custom allocators may also be supplied by the programmer.
Allocators are used by the C++ Standard Library to handle the allocation and deallocation of elements stored in containers. All C++ Standard Library containers except std::array have a template parameter of type allocator<Type> , where Type represents the type of the container element.
Your method only allocates the raw memory in one chunk and then has to do a placement new for each element. Combine that with all the overhead in your bitmap and its not too surprising that the default new
allocation beats yours assuming an empty heap.
To get the most speed improvement when allocating you can allocate the entire object in one large array and then assign to it from there. If you look at a very simple and contrived benchmark:
struct test_t {
float f;
int i;
test_t* pNext;
};
const size_t NUM_ALLOCS = 50000000;
void TestNew (void)
{
test_t* pPtr = new test_t;
for (int i = 0; i < NUM_ALLOCS; ++i)
{
pPtr->pNext = new test_t;
pPtr = pPtr->pNext;
}
}
void TestBucket (void)
{
test_t* pBuckets = new test_t[NUM_ALLOCS + 2];
test_t* pPtr = pBuckets++;
for (int i = 0; i < NUM_ALLOCS; ++i)
{
pPtr->pNext = pBuckets++;
pPtr = pPtr->pNext;
}
}
With this code on MSVC++ 2013 with 50M allocations TestBucket()
outperforms TestNew()
by over a factor of x16 (130 vs 2080 ms). Even if you add a std::bitset<>
to track allocations it is still x4 faster (400 ms).
An important thing to remember about new
is that the time its takes to allocate an object generally depends on the state of the heap. An empty heap will be able to allocate a bunch of constant sized objects like this relatively fast, which is probably one reason why your code seems slower than new
. If you have a program that runs for a while and allocates a large number of differently sized objects the heap can become fragmented and allocating objects can take much (much) longer.
As an example, one program I wrote was loading a 200MB file with millions of records...lots of differently sized allocations. On the first load it took ~15 seconds but if I deleted that file and tried to load it again it took something like x10-x20 longer. This was entirely due to memory allocation and switching to a simple bucket/arena allocator fixed the issue. So, that contrived benchmark I did showing a x16 speedup might actually get show a significantly larger difference with a fragmented heap.
It gets even more tricky when you realize that different systems/platforms may use different memory allocation schemes so the benchmark results on one system may be different from another.
To distil this into a few short points:
Note -- Benchmarks like this aren't meant to be realistic but are useful to determine the upper bound of how fast something can be. It can be used along with the profile/benchmark of your actual code to help determine what should/shouldn't be optimized.
Update -- I can't seem to replicate your results in my code under MSVC++ 2013. Using the same structure as your avlnode
and trying a placement new
yields the same speed as my non-placement bucket allocator tests (placement new was actually a little bit faster). Using a class similar to your avltree
doesn't affect the benchmark much. With 10 million allocations/deallocations I'm getting ~800 ms for the new
/delete
and ~200ms for the custom allocator (both with and without placement new
). While I'm not worried about the difference in absolute times, the relative time difference seems odd.
I would suggest taking a closer look at your benchmark and make sure you are measuring what you think you are. If the code exists in a larger code-base then create a minimal test case to benchmark it. Make sure that your compiler optimizer is not doing something that would invalidate the benchmark (it happens too easily these days).
Note that it would be far easier to answer your question if you had reduced it to a minimal example and included the complete code in the question, including the benchmark code. Benchmarking is one of those things that seems easy but there are a lot of "gotchas" involved in it.
Update 2 -- Including the basic allocator class and benchmark code I'm using so others can try to duplicate my results. Note that this is for testing only and is far from actual working/production code. It is far simpler than your code which may be why we're getting different results.
#include <string>
#include <Windows.h>
struct test_t
{
__int64 key;
__int64 weight;
__int64 left;
__int64 right;
test_t* pNext; // Simple linked list
test_t() : key(0), weight(0), pNext(NULL), left(0), right(0) { }
test_t(const __int64 k) : key(k), weight(0), pNext(NULL), left(0), right(0) { }
};
const size_t NUM_ALLOCS = 10000000;
test_t* pLast; //To prevent compiler optimizations from being "smart"
struct CTest
{
test_t* m_pBuffer;
size_t m_MaxSize;
size_t m_FreeIndex;
test_t* m_pFreeList;
CTest(const size_t Size) :
m_pBuffer(NULL),
m_MaxSize(Size),
m_pFreeList(NULL),
m_FreeIndex(0)
{
if (m_MaxSize > 0) m_pBuffer = (test_t *) new char[sizeof(test_t) * (m_MaxSize + 1)];
}
test_t* NewNode(__int64 key)
{
if (!m_pBuffer || m_FreeIndex >= m_MaxSize) return new test_t(key);
size_t Pos = m_FreeIndex;
++m_FreeIndex;
return new (&m_pBuffer[Pos]) test_t(key);
}
void DeleteNode(test_t* pNode)
{
if (!m_pBuffer) {
delete pNode;
}
else
{
pNode->pNext = m_pFreeList;
m_pFreeList = pNode;
}
}
};
void TestNew(void)
{
test_t* pPtr = new test_t;
test_t* pFirst = pPtr;
for (int i = 0; i < NUM_ALLOCS; ++i)
{
pPtr->pNext = new test_t;
pPtr = pPtr->pNext;
}
pPtr = pFirst;
while (pPtr)
{
test_t* pTemp = pPtr;
pPtr = pPtr->pNext;
delete pTemp;
}
pLast = pPtr;
}
void TestClass(const size_t BufferSize)
{
CTest Alloc(BufferSize);
test_t* pPtr = Alloc.NewNode(0);
test_t* pFirstPtr = pPtr;
for (int i = 0; i < NUM_ALLOCS; ++i)
{
pPtr->pNext = Alloc.NewNode(i);
pPtr = pPtr->pNext;
}
pLast = pPtr;
pPtr = pFirstPtr;
while (pPtr != NULL)
{
test_t* pTmp = pPtr->pNext;
Alloc.DeleteNode(pPtr);
pPtr = pTmp;
}
}
int main(void)
{
DWORD StartTick = GetTickCount();
TestClass(0);
//TestClass(NUM_ALLOCS + 10);
//TestNew();
DWORD EndTick = GetTickCount();
printf("Time = %u ms\n", EndTick - StartTick);
printf("Last = %p\n", pLast);
return 0;
}
Currently I'm getting ~800ms for both TestNew()
and TestClass(0)
and under 200ms for TestClass(NUM_ALLOCS + 10)
. The custom allocator is pretty fast as it operates on the memory in a completely linear fashion which allows the memory cache to work its magic. I'm also using GetTickCount()
for simplicity and it is accurate enough so long as times are above ~100ms.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With