Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can multithreading speed up memory allocation?

I'm working with an 8 core processor, and am using Boost threads to run a large program. Logically, the program can be split into groups, where each group is run by a thread. Inside each group, some classes invoke the 'new' operator a total of 10000 times. Rational Quantify shows that the 'new' memory allocation is taking up the maximum processing time when the program runs, and is slowing down the entire program.

One way I can speed up the system could be to use threads inside each 'group', so that the 10000 memory allocations can happen in parallel.

I'm unclear of how the memory allocation will be managed here. Will the OS scheduler really be able to allocate memory in parallel?

like image 492
Nav Avatar asked Feb 01 '11 05:02

Nav


1 Answers

Standard CRT

While with older of Visual Studio the default CRT allocator was blocking, this is no longer true at least for Visual Studio 2010 and newer, which calls corresponding OS functions directly. The Windows heap manager was blocking until Widows XP, in XP the optional Low Fragmentation Heap is not blocking, while the default one is, and newer OSes (Vista/Win7) use LFH by default. The performance of recent (Windows 7) allocators is very good, comparable to scalable replacements listed below (you still might prefer them if targeting older platforms or when you need some other features they provide). There exist several multiple "scalable allocators", with different licenses and different drawbacks. I think on Linux the default runtime library already uses a scalable allocator (some variant of PTMalloc).

Scalable replacements

I know about:

  • HOARD (GNU + commercial licenses)
  • MicroQuill SmartHeap for SMP (commercial license)
  • Google Perf Tools TCMalloc (BSD license)
  • NedMalloc (BSD license)
  • JemAlloc (BSD license)
  • PTMalloc (GNU, no Windows port yet?)
  • Intel Thread Building Blocks (GNU, commercial)

You might want to check Scalable memory allocator experiences for my experiences with trying to use some of them in a Windows project.

In practice most of them work by having a per thread cache and per thread pre-allocated regions for allocations, which means that small allocations most often happen inside of a context of thread only, OS services are called only infrequently.

like image 67
Suma Avatar answered Oct 12 '22 10:10

Suma