My application currently is highly performance critical and is requests 3-5 million objects per frame. Initially, to get the ball rolling, I new'd
everything and got the application to work and test my algorithms. The application is multi-threaded.
Once I was happy with the performance, I started to create a memory manager for my objects. The obvious reason is memory fragmentation and wastage. The application could not continue for more than a few frames before crashing due to memory fragmentation. I have checked for memory leaks and know the application is leak free.
So I started creating a simple memory manager using TBB's concurrent_queue
. The queue stores a maximum set of elements the application is allowed to use. The class requiring new elements pops elements from the queue. The try_pop
method is, according to Intel's documentation, lock-free. This worked quite well as far as memory consumption goes (although there is still memory fragmentation, but not nearly as much as before). The problem I am facing now is that the application's performance has slowed down approximately 4 times according to my own simple profiler (I do not have access to commercial profilers or know of any that will work on a real-time application... any recommendation would be appreciated).
My question is, is there a thread-safe memory pool that is scalable. A must-have
feature of the pool is fast recycling of elements and making them available. If there is none, any tips/tricks performance wise?
EDIT: I thought I would explain the problem a bit more. I could easily initialize n number of arrays where n is the number of threads and start using the objects from the arrays per thread. This will work perfectly for some cases. In my case, I am recycling the elements as well (potentially every frame) and they could be recycled at any point in the array; i.e. it may be from elementArray[0]
or elementArray[10]
or elementArray[1000]
part of the array. Now I will have a fragmented array of elements consisting of elements that are ready to be used and elements that are in-use :(
As said in comments, don't get a thread-safe memory allocator, allocate memory per-thread.
As you implied in your update, you need to manage free/in-use effectively. That is a pretty straightforward problem, given a constant type and no concurrency.
For example (off the top of my head, untested):
template<typename T>
class ThreadStorage
{
std::vector<T> m_objs;
std::vector<size_t> m_avail;
public:
explicit ThreadStorage(size_t count) : m_objs(count, T()) {
m_avail.reserve(count);
for (size_t i = 0; i < count; ++i) m_avail.push_back(i);
}
T* alloc() {
T* retval = &m_objs[0] + m_avail.back();
m_avail.pop_back();
return retval;
}
void free(T* p) {
*p = T(); // Assuming this is enough destruction.
m_avail.push_back(p - &m_objs[0]);
}
};
Then, for each thread, have a ThreadStorage instance, and call alloc() and free() as required.
You can add smart pointers to manage calling free() for you, and you can optimise constructor/destructor calling if that's expensive.
You can also look at boost::pool.
Update:
The new requirement for keeping track of things that have been used so that they can be processed in a second pass seems a bit unclear to me. I think you mean that when the primary processing is finished on an object, you need to not release it, but keep a reference to it for second stage processing. Some objects you will just be released back to the pool and not used for second stage processing.
I assume you want to do this in the same thread.
As a first pass, you could add a method like this to ThreadStorage, and call it when you want to do processing on all unreleased instances of T. No extra book keeping required.
void do_processing(boost::function<void (T* p)> const& f) {
std::sort(m_avail.begin(), m_avail.end());
size_t o = 0;
for (size_t i = 0; i != m_avail.size(); ++i) {
if (o < m_avail[i]) {
do {
f(&m_objs[o]);
} while (++o < m_avail[i]);
++o;
} else of (o == m_avail[i])
++o;
}
for (; o < m_objs.size(); ++o) f(&m_objs[o]);
}
Assumes no other thread is using the ThreadStorage instance, which is reasonable because it is thread-local by design. Again, off the top of my head, untested.
Google's TCMalloc,
TCMalloc assigns each thread a thread-local cache. Small allocations are satisfied from the thread-local cache. Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures.
Performance:
TCMalloc is faster than the glibc 2.3 malloc... ptmalloc2 takes approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With