I am curious to know if the allocating memory using a default new operator is a non-blocking operation.
e.g.
struct Node {
int a,b;
};
...
Node foo = new Node();
If multiple threads tried to create a new Node and if one of them was suspended by the OS in the middle of allocation, would it block other threads from making progress?
The reason why I ask is because I had a concurrent data structure that created new nodes. I then modified the algorithm to recycle the nodes. The throughput performance of the two algorithms was virtually identical on a 24 core machine. However, I then created an interference program that ran on all the system cores in order to create as much OS pre-emption as possible. The throughput performance of the algorithm that created new nodes decreased by a factor of 5 relative the algorithm that recycled nodes.
I'm curious to know why this would occur.
Thanks.
*Edit : pointing me to the code for the c++ memory allocator for linux would be helpful as well. I tried looking before posting this question, but had trouble finding it.
seems to me if your interference app were using new/delete (malloc/free), then the interference app's would interfere with the non recycle test more. But I don't know how your interference test is implemented.
Depending on how you recycle (ie if you use pthread mutexes god forbid) your recycle code could be slow (gcc atomic ops would be 40x faster at implementing recycle).
Malloc, in some variation for a long time on at least some platforms, has been aware of threads. Use the compiler switches on gcc to be sure you get it. Newer algorithms maintain pools of small memory chunks for each thread, so there is no or little blocking if your thread has the small item available. I have over simplified this and it depends on what malloc your system is using. Plus, if you go and allocate millions of items to do a test....well then you wont see that effect, because the small item pools are limited in size. Or maybe you will. I don't know. If you freed the item right after allocating, you would be more likely to see it. Freed small items go back into the small item lists rather than the shared heap. Although "what happens when thread B frees an item allocated by thread A" is a problem that may or may not be dealt with on your version of malloc and may not be dealt with in a non blocking manner. For sure, if you didn't immediately free during a large test, then the the thread would have to refill its small item list many times. That can block if more than one thread tries. Finally, at some point your process' heap will ask the system for heap memory, which can obviously block.
So are you using small memory items? For your malloc I don't know what small would be, but if you are < 1k that is for sure small. Are you allocating and freeing one after the other, or allocating thousands of nodes and then freeing thousands of nodes? Was your interference app allocating? All of these things will affect the results.
How to recycle with atomic ops (CAS = compare and swap):
First add a pNextFreeNode to your node object. I used void*, you can use your type. This code is for 32 bit pointers, but works for 64 bit as well. Then make a global recycle pile.
void *_pRecycleHead; // global head of recycle list.
Add to recycle pile:
void *Old;
while (1) { // concurrency loop
Old = _pRecycleHead; // copy the state of the world. We operate on the copy
pFreedNode->pNextFreeNode = Old; // chain the new node to the current head of recycled items
if (CAS(&_pRecycleHead, Old, pFreedNode)) // switch head of recycled items to new node
break; // success
}
remove from pile:
void *Old;
while (Old = _pRecycleHead) { // concurrency loop, only look for recycled items if the head aint null
if (CAS(&_pRecycleHead, Old, Old->pNextFreeNode)) // switch head to head->next.
break; // success
}
pNodeYoucanUseNow = Old;
Using CAS means the operation will succeed only if the item you are changing is the Old value you pass in. If there is a race and another thread got there first, then the old value will be different. In real life this race happens very very rarely. CAS is only slighlty slower than actually setting a value so compared to mutexes....it rocks.
The remove from pile, above, has a race condition if you add and remove the same item rapidly. We solve that by adding a version # to the CAS'able data. If you do the version # at the same time as the pointer to the head of the recycle pile you win. Use a union. Costs nothing extra to CAS 64 bits.
union TRecycle {
struct {
int iVersion;
void *pRecycleHead;
} ; // we can set these. Note, i didn't name this struct. You may have to if you want ANSI
unsigned long long n64; // we cas this
}
Note, You will have to go to 128 bit struct for 64 bit OS. so the global recycle pile looks like this now:
TRecycle _RecycleHead;
Add to recycle pile:
while (1) { // concurrency loop
TRecycle New,Old;
Old.n64 = _RecycleHead.n64; // copy state
New.n64 = Old.n64; // new state starts as a copy
pFreedNode->pNextFreeNode = Old.pRecycleHead; // link item to be recycled into recycle pile
New.pRecycleHead = pFreedNode; // make the new state
New.iVersion++; // adding item to list increments the version.
if (CAS(&_RecycleHead.n64, Old.n64, New.n64)) // now if version changed...we fail
break; // success
}
remove from pile:
while (1) { // concurrency loop
TRecycle New,Old;
Old.n64 = _RecycleHead.n64; // copy state
New.n64 = Old.n64; // new state starts as a copy
New.pRecycleHead = New.pRecycledHead.pNextFreeNode; // new will skip over first item in recycle list so we can have that item.
New.iVersion++; // taking an item off the list increments the version.
if (CAS(&_RecycleHead.n64, Old.n64, New.n64)) // we fail if version is different.
break; // success
}
pNodeYouCanUseNow = Old.pRecycledHead;
I bet if you recycle this way you will see a perf increase.
In multithreaded systems, malloc()
and free()
(and new
/ delete
) do typically use synchronisation primitives to make them safe to call from multiple threads.
This synchronisation does also affect the performance of some applications, particularly appliations that do a lot of allocation and deallocation in highly parallel environments. More efficient multithreaded memory allocators are an active field of research - see jemalloc
and tcmalloc
for two well-known ones.
This is really pretty much the same as this question.
Basically, malloc
isn't defined to be thread safe, but implementors are free to add implementation to make it thread safe. From your description, it sounds like your particular version is.
To be sure, in the words of Obi-Wan, "Use the Source, Luke." The malloc
source will be around and it's generally pretty straightforward to read.
@Mark, you can get the standard GNU libc source by
$ git clone git://sourceware.org/git/glibc.git
$ cd glibc
$ git checkout --track -b glibc-2_11-branch origin/release/2.11/master
See also here. Remember that malloc
is in manual section 3 -- it's a library function, so it won't be in your kernel sources. You might, however, need to read down into brk
,sbrk
, getrlimit
and setrlimit
and the like to find out what the kernel does.
One more link: the GCC project.
Okay, one more (I can stop any time): here's a page from which you can download the sources. Untar the file and you should find it at ./malloc/malloc.c
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With