Does multithreading emphasize memory fragmentation?

Description

When allocating and deallocating randomly sized memory chunks with 4 or more threads using openmp's parallel for construct, the program seems to start leaking considerable amounts of memory in the second half of the test-program's runtime. Thus it increases its consumed memory from 1050 MB to 1500 MB or more without actually making use of the extra memory.

As valgrind shows no issues, I must assume that what appears to be a memory leak actually is an emphasized effect of memory fragmentation.

Interestingly, the effect does not show yet if 2 threads make 10000 allocations each, but it shows strongly if 4 threads make 5000 allocations each. Also, if the maximum size of allocated chunks is reduced to 256kb (from 1mb), the effect gets weaker.

Can heavy concurrency emphasize fragmentation that much ? Or is this more likely to be a bug in the heap ?

Test Program Description

The demo program is build to obtain a total of 256 MB of randomly sized memory chunks from the heap, doing 5000 allocations. If the memory limit is hit, the chunks allocated first will be deallocated until the memory consumption falls below the limit. Once 5000 allocations where performed, all memory is released and the loop ends. All this work is done for each thread generated by openmp.

This memory allocation scheme allows us to expect a memory consumption of ~260 MB per thread (including some bookkeeping data).

Demo Program

As this is really something you might want to test, you can download the sample program with a simple makefile from dropbox.

When running the program as is, you should have at least 1400 MB of RAM available. Feel free to adjust the constants in the code to suit your needs.

For completeness, the actual code follows:

#include <stdlib.h> #include <stdio.h> #include <iostream> #include <vector> #include <deque>  #include <omp.h> #include <math.h>  typedef unsigned long long uint64_t;  void runParallelAllocTest() {     // constants     const int  NUM_ALLOCATIONS = 5000; // alloc's per thread     const int  NUM_THREADS = 4;       // how many threads?     const int  NUM_ITERS = NUM_THREADS;// how many overall repetions      const bool USE_NEW      = true;   // use new or malloc? , seems to make no difference (as it should)     const bool DEBUG_ALLOCS = false;  // debug output      // pre store allocation sizes     const int  NUM_PRE_ALLOCS = 20000;     const uint64_t MEM_LIMIT = (1024 * 1024) * 256;   // x MB per process     const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1;      srand(1);     std::vector<size_t> allocations;     allocations.resize(NUM_PRE_ALLOCS);     for (int i = 0; i < NUM_PRE_ALLOCS; i++) {         allocations[i] = rand() % MAX_CHUNK_SIZE;   // use up to x MB chunks     }       #pragma omp parallel num_threads(NUM_THREADS)     #pragma omp for     for (int i = 0; i < NUM_ITERS; ++i) {         uint64_t long totalAllocBytes = 0;         uint64_t currAllocBytes = 0;          std::deque< std::pair<char*, uint64_t> > pointers;         const int myId = omp_get_thread_num();          for (int j = 0; j < NUM_ALLOCATIONS; ++j) {             // new allocation             const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ];              char* pnt = NULL;             if (USE_NEW) {                 pnt = new char[allocSize];             } else {                 pnt = (char*) malloc(allocSize);             }             pointers.push_back(std::make_pair(pnt, allocSize));              totalAllocBytes += allocSize;             currAllocBytes  += allocSize;              // fill with values to add "delay"             for (int fill = 0; fill < (int) allocSize; ++fill) {                 pnt[fill] = (char)(j % 255);             }               if (DEBUG_ALLOCS) {                 std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n";             }              // free all or just a bit             if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) {                 int frees = 0;                  // keep this much allocated                 // last check, free all                 uint64_t memLimit = MEM_LIMIT;                 if (j == NUM_ALLOCATIONS - 1) {                     std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl;                     memLimit = 0;                 }                 //MEM_LIMIT = 0; // DEBUG                  while (pointers.size() > 0 && (currAllocBytes > memLimit)) {                     // free one of the first entries to allow previously obtained resources to 'live' longer                     currAllocBytes -= pointers.front().second;                     char* pnt       = pointers.front().first;                      // free memory                     if (USE_NEW) {                         delete[] pnt;                     } else {                         free(pnt);                     }                      // update array                     pointers.pop_front();                      if (DEBUG_ALLOCS) {                         std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n";                     }                     frees++;                 }                 if (DEBUG_ALLOCS) {                     std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n";                 }             }         } // for each allocation          if (currAllocBytes != 0) {             std::cerr << "Not all free'd!\n";         }          std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n";     } // for each iteration      exit(1); }  int main(int argc, char** argv) {     runParallelAllocTest();      return 0; }

The Test-System

From what I see so far, the hardware matters a lot. The test might need adjustments if run on a faster machine.

Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz Ubuntu 10.04 LTS 64 bit gcc 4.3, 4.4, 4.6 3988.62 Bogomips

Testing

Once you have executed the makefile, you should get a file named ompmemtest. To query the memory usage over time, I used the following commands:

./ompmemtest & top -b | grep ompmemtest

Which yields the quite impressive fragmentation or leaking behaviour. The expected memory consumption with 4 threads is 1090 MB, which became 1500 MB over time:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 11626 byron     20   0  204m  99m 1000 R   27  2.5   0:00.81 ompmemtest                                                                               11626 byron     20   0  992m 832m 1004 R  195 21.0   0:06.69 ompmemtest                                                                               11626 byron     20   0 1118m 1.0g 1004 R  189 26.1   0:12.40 ompmemtest                                                                               11626 byron     20   0 1218m 1.0g 1004 R  190 27.1   0:18.13 ompmemtest                                                                               11626 byron     20   0 1282m 1.1g 1004 R  195 29.6   0:24.06 ompmemtest                                                                               11626 byron     20   0 1471m 1.3g 1004 R  195 33.5   0:29.96 ompmemtest                                                                               11626 byron     20   0 1469m 1.3g 1004 R  194 33.5   0:35.85 ompmemtest                                                                               11626 byron     20   0 1469m 1.3g 1004 R  195 33.6   0:41.75 ompmemtest                                                                               11626 byron     20   0 1636m 1.5g 1004 R  194 37.8   0:47.62 ompmemtest                                                                               11626 byron     20   0 1660m 1.5g 1004 R  195 38.0   0:53.54 ompmemtest                                                                               11626 byron     20   0 1669m 1.5g 1004 R  195 38.2   0:59.45 ompmemtest                                                                               11626 byron     20   0 1664m 1.5g 1004 R  194 38.1   1:05.32 ompmemtest                                                                               11626 byron     20   0 1724m 1.5g 1004 R  195 40.0   1:11.21 ompmemtest                                                                               11626 byron     20   0 1724m 1.6g 1140 S  193 40.1   1:17.07 ompmemtest

Please Note: I could reproduce this issue when compiling with gcc 4.3, 4.4 and 4.6(trunk).

306

asked May 03 '11 21:05

Byron

1 Answers

Ok, picked up the bait.

This is on a system with

Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz 4x5666.59 bogomips  Linux meerkat 2.6.35-28-generic-pae #50-Ubuntu SMP Fri Mar 18 20:43:15 UTC 2011 i686 GNU/Linux  gcc version 4.4.5               total       used       free     shared    buffers     cached Mem:       8127172    4220560    3906612          0     374328    2748796 -/+ buffers/cache:    1097436    7029736 Swap:            0          0          0

Naive run

I just ran it

time ./ompmemtest  Id 0 about to release all memory: 258.144 MB Id 0 done, total alloc'ed -1572.7MB  Id 3 about to release all memory: 257.854 MB Id 3 done, total alloc'ed -1569.6MB  Id 1 about to release all memory: 257.339 MB Id 2 about to release all memory: 257.043 MB Id 1 done, total alloc'ed -1570.42MB  Id 2 done, total alloc'ed -1569.96MB   real    0m13.429s user    0m44.619s sys 0m6.000s

Nothing spectacular. Here is the simultaneous output of vmstat -S M 1

Vmstat raw data

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----  0  0      0   3892    364   2669    0    0    24     0  701 1487  2  1 97  0  4  0      0   3421    364   2669    0    0     0     0 1317 1953 53  7 40  0  4  0      0   2858    364   2669    0    0     0     0 2715 5030 79 16  5  0  4  0      0   2861    364   2669    0    0     0     0 6164 12637 76 15  9  0  4  0      0   2853    364   2669    0    0     0     0 4845 8617 77 13 10  0  4  0      0   2848    364   2669    0    0     0     0 3782 7084 79 13  8  0  5  0      0   2842    364   2669    0    0     0     0 3723 6120 81 12  7  0  4  0      0   2835    364   2669    0    0     0     0 3477 4943 84  9  7  0  4  0      0   2834    364   2669    0    0     0     0 3273 4950 81 10  9  0  5  0      0   2828    364   2669    0    0     0     0 3226 4812 84 11  6  0  4  0      0   2823    364   2669    0    0     0     0 3250 4889 83 10  7  0  4  0      0   2826    364   2669    0    0     0     0 3023 4353 85 10  6  0  4  0      0   2817    364   2669    0    0     0     0 3176 4284 83 10  7  0  4  0      0   2823    364   2669    0    0     0     0 3008 4063 84 10  6  0  0  0      0   3893    364   2669    0    0     0     0 4023 4228 64 10 26  0

Does that information mean anything to you?

Google Thread Caching Malloc

Now for real fun, add a little spice

time LD_PRELOAD="/usr/lib/libtcmalloc.so" ./ompmemtest  Id 1 about to release all memory: 257.339 MB Id 1 done, total alloc'ed -1570.42MB  Id 3 about to release all memory: 257.854 MB Id 3 done, total alloc'ed -1569.6MB  Id 2 about to release all memory: 257.043 MB Id 2 done, total alloc'ed -1569.96MB  Id 0 about to release all memory: 258.144 MB Id 0 done, total alloc'ed -1572.7MB   real    0m11.663s user    0m44.255s sys 0m1.028s

Looks faster, not?

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----  4  0      0   3562    364   2684    0    0     0     0 1041 1676 28  7 64  0  4  2      0   2806    364   2684    0    0     0   172 1641 1843 84 14  1  0  4  0      0   2758    364   2685    0    0     0     0 1520 1009 98  2  1  0  4  0      0   2747    364   2685    0    0     0     0 1504  859 98  2  0  0  5  0      0   2745    364   2685    0    0     0     0 1575 1073 98  2  0  0  5  0      0   2739    364   2685    0    0     0     0 1415  743 99  1  0  0  4  0      0   2738    364   2685    0    0     0     0 1526  981 99  2  0  0  4  0      0   2731    364   2685    0    0     0   684 1536  927 98  2  0  0  4  0      0   2730    364   2685    0    0     0     0 1584 1010 99  1  0  0  5  0      0   2730    364   2685    0    0     0     0 1461  917 99  2  0  0  4  0      0   2729    364   2685    0    0     0     0 1561 1036 99  1  0  0  4  0      0   2729    364   2685    0    0     0     0 1406  756 100  1  0  0  0  0      0   3819    364   2685    0    0     0     4 1159 1476 26  3 71  0

In case you wanted to compare vmstat outputs

`Valgrind --tool massif`

This is the head of output from ms_print after valgrind --tool=massif ./ompmemtest (default malloc):

-------------------------------------------------------------------------------- Command:            ./ompmemtest Massif arguments:   (none) ms_print arguments: massif.out.beforetcmalloc --------------------------------------------------------------------------------       GB 1.009^                                                                     :        |       ##::::@@:::::::@@::::::@@::::@@::@::::@::::@:::::::::@::::::@:::       |       # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::       |       # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::       |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::       |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::       |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::      |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::      |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::      |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::      |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::      |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::      |   ::::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::      |   : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::      |   : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::      |  :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::      |  :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::      | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::      | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::      | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::    0 +----------------------------------------------------------------------->Gi      0                                                                   264.0  Number of snapshots: 63  Detailed snapshots: [6 (peak), 10, 17, 23, 27, 30, 35, 39, 48, 56]

Google HEAPPROFILE

Unfortunately, vanilla valgrind doesn't work with tcmalloc, so I switched horses midrace to heap profiling with google-perftools

gcc openMpMemtest_Linux.cpp -fopenmp -lgomp -lstdc++ -ltcmalloc -o ompmemtest  time HEAPPROFILE=/tmp/heapprofile ./ompmemtest Starting tracking the heap Dumping heap profile to /tmp/heapprofile.0001.heap (100 MB currently in use) Dumping heap profile to /tmp/heapprofile.0002.heap (200 MB currently in use) Dumping heap profile to /tmp/heapprofile.0003.heap (300 MB currently in use) Dumping heap profile to /tmp/heapprofile.0004.heap (400 MB currently in use) Dumping heap profile to /tmp/heapprofile.0005.heap (501 MB currently in use) Dumping heap profile to /tmp/heapprofile.0006.heap (601 MB currently in use) Dumping heap profile to /tmp/heapprofile.0007.heap (701 MB currently in use) Dumping heap profile to /tmp/heapprofile.0008.heap (801 MB currently in use) Dumping heap profile to /tmp/heapprofile.0009.heap (902 MB currently in use) Dumping heap profile to /tmp/heapprofile.0010.heap (1002 MB currently in use) Dumping heap profile to /tmp/heapprofile.0011.heap (2029 MB allocated cumulatively, 1031 MB currently in use) Dumping heap profile to /tmp/heapprofile.0012.heap (3053 MB allocated cumulatively, 1030 MB currently in use) Dumping heap profile to /tmp/heapprofile.0013.heap (4078 MB allocated cumulatively, 1031 MB currently in use) Dumping heap profile to /tmp/heapprofile.0014.heap (5102 MB allocated cumulatively, 1031 MB currently in use) Dumping heap profile to /tmp/heapprofile.0015.heap (6126 MB allocated cumulatively, 1033 MB currently in use) Dumping heap profile to /tmp/heapprofile.0016.heap (7151 MB allocated cumulatively, 1029 MB currently in use) Dumping heap profile to /tmp/heapprofile.0017.heap (8175 MB allocated cumulatively, 1029 MB currently in use) Dumping heap profile to /tmp/heapprofile.0018.heap (9199 MB allocated cumulatively, 1028 MB currently in use) Id 0 about to release all memory: 258.144 MB Id 0 done, total alloc'ed -1572.7MB  Id 2 about to release all memory: 257.043 MB Id 2 done, total alloc'ed -1569.96MB  Id 3 about to release all memory: 257.854 MB Id 3 done, total alloc'ed -1569.6MB  Id 1 about to release all memory: 257.339 MB Id 1 done, total alloc'ed -1570.42MB  Dumping heap profile to /tmp/heapprofile.0019.heap (Exiting)  real    0m11.981s user    0m44.455s sys 0m1.124s

Contact me for full logs/details

Update

To the comments: I updated the program

--- omptest/openMpMemtest_Linux.cpp 2011-05-03 23:18:44.000000000 +0200 +++ q/openMpMemtest_Linux.cpp   2011-05-04 13:42:47.371726000 +0200 @@ -13,8 +13,8 @@  void runParallelAllocTest()  {     // constants -   const int  NUM_ALLOCATIONS = 5000; // alloc's per thread -   const int  NUM_THREADS = 4;       // how many threads? +   const int  NUM_ALLOCATIONS = 55000; // alloc's per thread +   const int  NUM_THREADS = 8;        // how many threads?     const int  NUM_ITERS = NUM_THREADS;// how many overall repetions      const bool USE_NEW      = true;   // use new or malloc? , seems to make no difference (as it should)

It ran for over 5m3s. Close to the end, a screenshot of htop teaches that indeed, the reserved set is slightly higher, going towards 2.3g:

  1  [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%]     Tasks: 125 total, 2 running   2  [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%]     Load average: 8.09 5.24 2.37    3  [||||||||||||||||||||||||||||||||||||||||||||||||||97.4%]     Uptime: 01:54:22   4  [||||||||||||||||||||||||||||||||||||||||||||||||||96.1%]   Mem[|||||||||||||||||||||||||||||||             3055/7936MB]   Swp[                                                  0/0MB]    PID USER     NLWP PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command  4330 sehe        8  20   0 2635M 2286M   908 R 368. 28.8 15:35.01 ./ompmemtest

Comparing results with a tcmalloc run: 4m12s, ~~similar top stats~~ has minor differences; the big difference is in the VIRT set (but that isn't particularly useful unless you have a very limited address space per process?). The RES set is quite similar, if you ask me. The more important thing to note is parallellism is increased; all cores are now maxed out. This is obviously due to reduced need to lock for heap operations when using tcmalloc:

If the free list is empty: (1) We fetch a bunch of objects from a central free list for this size-class (the central free list is shared by all threads). (2) Place them in the thread-local free list. (3) Return one of the newly fetched objects to the applications.

  1  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Tasks: 172 total, 2 running   2  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Load average: 7.39 2.92 1.11    3  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Uptime: 11:12:25   4  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]   Mem[||||||||||||||||||||||||||||||||||||||||||||              3278/7936MB]   Swp[                                                                0/0MB]    PID USER     NLWP PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command 14391 sehe        8  20   0 2251M 2179M  1148 R 379. 27.5  8:08.92 ./ompmemtest

answered Oct 05 '22 10:10

sehe

Related questions
                            
                                Is a C++ destructor guaranteed not to be called until the end of the block?
                            
                                Is this behavior of vector::resize(size_type n) under C++11 and Boost.Container correct?
                            
                                Getting the current time (in milliseconds) from the system clock in Windows?
                            
                                C++ std::map holding ANY type of value
                            
                                How to sleep a C++ Boost Thread
                            
                                What is the address of a function in a C++ program?
                            
                                STL way to add a constant value to a std::vector
                            
                                Strange ambiguous call to overloaded function error
                            
                                XOR Operation Intuition
                            
                                Missing default argument - compiler error
                            
                                HighGUI is missing from OpenCV 3.0.0 JAR
                            
                                Error: Could not resolve SDK path for 'macosx10.8'
                            
                                create my own programming language [duplicate]
                            
                                malloc & placement new vs. new
                            
                                How do I print bytes as hexadecimal?
                            
                                Default template parameter & lambda in unevaluated context: bug or feature?
                            
                                How do I give streams better exception messages?
                            
                                C++11 atomic memory ordering - is this a correct usage of relaxed (release-consume) ordering?
                            
                                Fold expressions with arbitrary callable?
                            
                                Is this code behavior defined?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With