I have a very large array of ~30M objects approximately 80bytes apiece – that's ~2.2GB for those following along – stored on the disk. The actual size of each object varies a little because each one has a QMap<quint32, QVariant>
child.
Unpacking those objects from raw data is expensive, so I've implemented a multithreaded read operation that pulls a few MB from disk sequentially and then passes each raw data block to a thread to get unpacked in parallel via QtConcurrent
. My objects are created (via new
) on the heap inside the working threads and then passed back to the main thread for the next step. Upon completion, these objects are deleted on the main thread.
In a single-threaded environment, this deallocation is relatively fast (~4-5 seconds). However, when multithreaded on 4 threads this deallocation is incredibly slow (~26-36 seconds). Profiling this with Very Sleepy indicates that the slowdown is in MSVCR100 free
, so it's the deallocation itself that is slow.
Searching around SO suggests that allocating and deallocating on different threads is safe. What is the source of the slowdown, and what can I do about it?
Edit: Some sample code communicating the idea of what's going on: For the sake of troubleshooting, I have completely removed the disk IO from this example and simply create the objects and then delete them.
class MyObject
{
public:
MyObject() { /* set defaults... irrelevant here */}
~MyObject() {}
QMap<quint32, QVariant> map;
//...other members
}
//...
QList<MyObject*> results;
/* set up the mapped lambda functor (QtConcurrent reqs std::function if returning) */
std::function<QList<MyObject*>(quint64 chunksize)>
importMap = [](quint64 chunksize) -> QList<MyObject*>
{
QList<MyObject*> objs;
for(int i = 0; i < chunksize; ++i)
{
MyObject* obj = new MyObject();
obj->map.insert(0, 1); //ran with and without the map insertions
obj->map.insert(1, 2);
objs.append(obj);
}
return objs;
}; //end import map lambda
/* set up the reduce lambda functor */
auto importReduce = [&results](bool& /*noreturn*/, const QList<MyObject*> chunkimported)
{
results.append(chunkimported);
}; //end import reduce lambda
/* chunk up the data for import */
quint64 totalcount = 31833986;
quint64 chunksize = 500000;
QList<quint64> chunklist;
while(totalcount >= chunksize)
{
totalcount -= chunksize;
chunklist.append(chunksize);
}
if(totalcount > 0)
chunklist.append(totalcount);
/* create the objects concurrently */
QThreadPool::globalInstance()->setMaxThreadCount(1); //4 for multithreaded run
QElapsedTimer tnew; tnew.start();
QtConcurrent::mappedReduced<bool>(chunklist, importMap, importReduce, QtConcurrent::OrderedReduce | QtConcurrent::SequentialReduce);
qDebug("DONE NEW %f", double(tnew.elapsed())/1000.0);
//do stuff with the objects here
/* delete the objects */
QElapsedTimer tdelete; tdelete.start();
qDeleteAll(results);
qDebug("DONE DELETE %f", double(tdelete.elapsed())/1000.0);
Here are the results with and without inserting data to MyObject::map, and with 1 or 4 threads available to QtConcurrent:
tnew
= 2.7 seconds; tdelete
= 1.1 secondstnew
= 1.8 seconds; tdelete
= 2.7 secondstnew
= 8.6 seconds; tdelete
= 4.6 secondstnew
= 4.0 seconds; tdelete
= 48.1 secondsIn both scenarios it takes significantly longer to delete the objects when they were created in parallel on 4 threads vs. in serial on 1 thread, which was further exacerbated by inserting to QMap in parallel.
It is pretty much speculations, but I presume the OS memory manager would be one system wide, after all it does service one pool of virtual memory, so throwing more threads at it will not improve speed, it will just choke it with overhead. Thread safety coupled with concurrent access always comes with a penalty. So the more threads you throw at it the more penalty you will get.
30M allocations is quite a lot, regardless of the size of the allocations, and it also represents a significant overhead memory consumption wise. I'd recommend you invest the time to implement memory pools, preallocating monolithic chunks of memory and using placement new to allocate objects inside those pools. This will be a tremendous CPU time saver and a significant memory saver as well. Additionally, it will increase cache friendliness and cache hits by reducing fragmentation.
To put it as a metaphor, putting 4 cooks on a single stove won't make cooking 4 times faster, it will make each cook at least 4 times slower plus the time they will waste in conflict of resource usage. That's pretty much what you are seeing in practice.
(updating comment to answer)
It could be because with one thread all the allocations are sequential, so the frees are as well. With the multithreaded allocations, they are more intermixed so free needs to do more work to clean up after each deallocation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With