When allocating and deallocating randomly sized memory chunks with 4 or more threads using openmp's parallel for construct, the program seems to start leaking considerable amounts of memory in the second half of the test-program's runtime. Thus it increases its consumed memory from 1050 MB to 1500 MB or more without actually making use of the extra memory.
As valgrind shows no issues, I must assume that what appears to be a memory leak actually is an emphasized effect of memory fragmentation.
Interestingly, the effect does not show yet if 2 threads make 10000 allocations each, but it shows strongly if 4 threads make 5000 allocations each. Also, if the maximum size of allocated chunks is reduced to 256kb (from 1mb), the effect gets weaker.
Can heavy concurrency emphasize fragmentation that much ? Or is this more likely to be a bug in the heap ?
The demo program is build to obtain a total of 256 MB of randomly sized memory chunks from the heap, doing 5000 allocations. If the memory limit is hit, the chunks allocated first will be deallocated until the memory consumption falls below the limit. Once 5000 allocations where performed, all memory is released and the loop ends. All this work is done for each thread generated by openmp.
This memory allocation scheme allows us to expect a memory consumption of ~260 MB per thread (including some bookkeeping data).
As this is really something you might want to test, you can download the sample program with a simple makefile from dropbox.
When running the program as is, you should have at least 1400 MB of RAM available. Feel free to adjust the constants in the code to suit your needs.
For completeness, the actual code follows:
#include <stdlib.h> #include <stdio.h> #include <iostream> #include <vector> #include <deque> #include <omp.h> #include <math.h> typedef unsigned long long uint64_t; void runParallelAllocTest() { // constants const int NUM_ALLOCATIONS = 5000; // alloc's per thread const int NUM_THREADS = 4; // how many threads? const int NUM_ITERS = NUM_THREADS;// how many overall repetions const bool USE_NEW = true; // use new or malloc? , seems to make no difference (as it should) const bool DEBUG_ALLOCS = false; // debug output // pre store allocation sizes const int NUM_PRE_ALLOCS = 20000; const uint64_t MEM_LIMIT = (1024 * 1024) * 256; // x MB per process const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1; srand(1); std::vector<size_t> allocations; allocations.resize(NUM_PRE_ALLOCS); for (int i = 0; i < NUM_PRE_ALLOCS; i++) { allocations[i] = rand() % MAX_CHUNK_SIZE; // use up to x MB chunks } #pragma omp parallel num_threads(NUM_THREADS) #pragma omp for for (int i = 0; i < NUM_ITERS; ++i) { uint64_t long totalAllocBytes = 0; uint64_t currAllocBytes = 0; std::deque< std::pair<char*, uint64_t> > pointers; const int myId = omp_get_thread_num(); for (int j = 0; j < NUM_ALLOCATIONS; ++j) { // new allocation const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ]; char* pnt = NULL; if (USE_NEW) { pnt = new char[allocSize]; } else { pnt = (char*) malloc(allocSize); } pointers.push_back(std::make_pair(pnt, allocSize)); totalAllocBytes += allocSize; currAllocBytes += allocSize; // fill with values to add "delay" for (int fill = 0; fill < (int) allocSize; ++fill) { pnt[fill] = (char)(j % 255); } if (DEBUG_ALLOCS) { std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n"; } // free all or just a bit if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) { int frees = 0; // keep this much allocated // last check, free all uint64_t memLimit = MEM_LIMIT; if (j == NUM_ALLOCATIONS - 1) { std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl; memLimit = 0; } //MEM_LIMIT = 0; // DEBUG while (pointers.size() > 0 && (currAllocBytes > memLimit)) { // free one of the first entries to allow previously obtained resources to 'live' longer currAllocBytes -= pointers.front().second; char* pnt = pointers.front().first; // free memory if (USE_NEW) { delete[] pnt; } else { free(pnt); } // update array pointers.pop_front(); if (DEBUG_ALLOCS) { std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n"; } frees++; } if (DEBUG_ALLOCS) { std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n"; } } } // for each allocation if (currAllocBytes != 0) { std::cerr << "Not all free'd!\n"; } std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n"; } // for each iteration exit(1); } int main(int argc, char** argv) { runParallelAllocTest(); return 0; }
From what I see so far, the hardware matters a lot. The test might need adjustments if run on a faster machine.
Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz Ubuntu 10.04 LTS 64 bit gcc 4.3, 4.4, 4.6 3988.62 Bogomips
Once you have executed the makefile, you should get a file named ompmemtest
. To query the memory usage over time, I used the following commands:
./ompmemtest & top -b | grep ompmemtest
Which yields the quite impressive fragmentation or leaking behaviour. The expected memory consumption with 4 threads is 1090 MB, which became 1500 MB over time:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11626 byron 20 0 204m 99m 1000 R 27 2.5 0:00.81 ompmemtest 11626 byron 20 0 992m 832m 1004 R 195 21.0 0:06.69 ompmemtest 11626 byron 20 0 1118m 1.0g 1004 R 189 26.1 0:12.40 ompmemtest 11626 byron 20 0 1218m 1.0g 1004 R 190 27.1 0:18.13 ompmemtest 11626 byron 20 0 1282m 1.1g 1004 R 195 29.6 0:24.06 ompmemtest 11626 byron 20 0 1471m 1.3g 1004 R 195 33.5 0:29.96 ompmemtest 11626 byron 20 0 1469m 1.3g 1004 R 194 33.5 0:35.85 ompmemtest 11626 byron 20 0 1469m 1.3g 1004 R 195 33.6 0:41.75 ompmemtest 11626 byron 20 0 1636m 1.5g 1004 R 194 37.8 0:47.62 ompmemtest 11626 byron 20 0 1660m 1.5g 1004 R 195 38.0 0:53.54 ompmemtest 11626 byron 20 0 1669m 1.5g 1004 R 195 38.2 0:59.45 ompmemtest 11626 byron 20 0 1664m 1.5g 1004 R 194 38.1 1:05.32 ompmemtest 11626 byron 20 0 1724m 1.5g 1004 R 195 40.0 1:11.21 ompmemtest 11626 byron 20 0 1724m 1.6g 1140 S 193 40.1 1:17.07 ompmemtest
Please Note: I could reproduce this issue when compiling with gcc 4.3, 4.4 and 4.6(trunk).
Memory fragmentation is caused by a combination of the allocation strategy used by the allocator you are using, the sizes and alignments of the internal structures, combined with the memory allocation behaviour of your software applcation.
Reducing the number of sizes between these extremes also helps. Employing sizes that increase logarithmically saves a lot of fragmentation. For example, each size could be 20% larger than the previous size. “One size fits all” might not be true for memory allocators in embedded system.
Multi-threading gets around requiring additional memory because it relies on a shared memory between threads. Shared memory removes the additional memory overhead but still incurs the penalty of increased context switching.
This problem occurs when you allocate RAM to processes continuously. It is done in paging and segmentation, where memory is allocated to processes non-contiguously. As a result, if you remove this condition, external fragmentation may be decreased. Compaction is another method for removing external fragmentation.
Ok, picked up the bait.
This is on a system with
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz 4x5666.59 bogomips Linux meerkat 2.6.35-28-generic-pae #50-Ubuntu SMP Fri Mar 18 20:43:15 UTC 2011 i686 GNU/Linux gcc version 4.4.5 total used free shared buffers cached Mem: 8127172 4220560 3906612 0 374328 2748796 -/+ buffers/cache: 1097436 7029736 Swap: 0 0 0
I just ran it
time ./ompmemtest Id 0 about to release all memory: 258.144 MB Id 0 done, total alloc'ed -1572.7MB Id 3 about to release all memory: 257.854 MB Id 3 done, total alloc'ed -1569.6MB Id 1 about to release all memory: 257.339 MB Id 2 about to release all memory: 257.043 MB Id 1 done, total alloc'ed -1570.42MB Id 2 done, total alloc'ed -1569.96MB real 0m13.429s user 0m44.619s sys 0m6.000s
Nothing spectacular. Here is the simultaneous output of vmstat -S M 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 0 0 0 3892 364 2669 0 0 24 0 701 1487 2 1 97 0 4 0 0 3421 364 2669 0 0 0 0 1317 1953 53 7 40 0 4 0 0 2858 364 2669 0 0 0 0 2715 5030 79 16 5 0 4 0 0 2861 364 2669 0 0 0 0 6164 12637 76 15 9 0 4 0 0 2853 364 2669 0 0 0 0 4845 8617 77 13 10 0 4 0 0 2848 364 2669 0 0 0 0 3782 7084 79 13 8 0 5 0 0 2842 364 2669 0 0 0 0 3723 6120 81 12 7 0 4 0 0 2835 364 2669 0 0 0 0 3477 4943 84 9 7 0 4 0 0 2834 364 2669 0 0 0 0 3273 4950 81 10 9 0 5 0 0 2828 364 2669 0 0 0 0 3226 4812 84 11 6 0 4 0 0 2823 364 2669 0 0 0 0 3250 4889 83 10 7 0 4 0 0 2826 364 2669 0 0 0 0 3023 4353 85 10 6 0 4 0 0 2817 364 2669 0 0 0 0 3176 4284 83 10 7 0 4 0 0 2823 364 2669 0 0 0 0 3008 4063 84 10 6 0 0 0 0 3893 364 2669 0 0 0 0 4023 4228 64 10 26 0
Does that information mean anything to you?
Now for real fun, add a little spice
time LD_PRELOAD="/usr/lib/libtcmalloc.so" ./ompmemtest Id 1 about to release all memory: 257.339 MB Id 1 done, total alloc'ed -1570.42MB Id 3 about to release all memory: 257.854 MB Id 3 done, total alloc'ed -1569.6MB Id 2 about to release all memory: 257.043 MB Id 2 done, total alloc'ed -1569.96MB Id 0 about to release all memory: 258.144 MB Id 0 done, total alloc'ed -1572.7MB real 0m11.663s user 0m44.255s sys 0m1.028s
Looks faster, not?
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 4 0 0 3562 364 2684 0 0 0 0 1041 1676 28 7 64 0 4 2 0 2806 364 2684 0 0 0 172 1641 1843 84 14 1 0 4 0 0 2758 364 2685 0 0 0 0 1520 1009 98 2 1 0 4 0 0 2747 364 2685 0 0 0 0 1504 859 98 2 0 0 5 0 0 2745 364 2685 0 0 0 0 1575 1073 98 2 0 0 5 0 0 2739 364 2685 0 0 0 0 1415 743 99 1 0 0 4 0 0 2738 364 2685 0 0 0 0 1526 981 99 2 0 0 4 0 0 2731 364 2685 0 0 0 684 1536 927 98 2 0 0 4 0 0 2730 364 2685 0 0 0 0 1584 1010 99 1 0 0 5 0 0 2730 364 2685 0 0 0 0 1461 917 99 2 0 0 4 0 0 2729 364 2685 0 0 0 0 1561 1036 99 1 0 0 4 0 0 2729 364 2685 0 0 0 0 1406 756 100 1 0 0 0 0 0 3819 364 2685 0 0 0 4 1159 1476 26 3 71 0
In case you wanted to compare vmstat outputs
Valgrind --tool massif
This is the head of output from ms_print
after valgrind --tool=massif ./ompmemtest
(default malloc):
-------------------------------------------------------------------------------- Command: ./ompmemtest Massif arguments: (none) ms_print arguments: massif.out.beforetcmalloc -------------------------------------------------------------------------------- GB 1.009^ : | ##::::@@:::::::@@::::::@@::::@@::@::::@::::@:::::::::@::::::@::: | # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: | # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: | :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: | :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: | :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::: 0 +----------------------------------------------------------------------->Gi 0 264.0 Number of snapshots: 63 Detailed snapshots: [6 (peak), 10, 17, 23, 27, 30, 35, 39, 48, 56]
Unfortunately, vanilla valgrind
doesn't work with tcmalloc
, so I switched horses midrace to heap profiling with google-perftools
gcc openMpMemtest_Linux.cpp -fopenmp -lgomp -lstdc++ -ltcmalloc -o ompmemtest time HEAPPROFILE=/tmp/heapprofile ./ompmemtest Starting tracking the heap Dumping heap profile to /tmp/heapprofile.0001.heap (100 MB currently in use) Dumping heap profile to /tmp/heapprofile.0002.heap (200 MB currently in use) Dumping heap profile to /tmp/heapprofile.0003.heap (300 MB currently in use) Dumping heap profile to /tmp/heapprofile.0004.heap (400 MB currently in use) Dumping heap profile to /tmp/heapprofile.0005.heap (501 MB currently in use) Dumping heap profile to /tmp/heapprofile.0006.heap (601 MB currently in use) Dumping heap profile to /tmp/heapprofile.0007.heap (701 MB currently in use) Dumping heap profile to /tmp/heapprofile.0008.heap (801 MB currently in use) Dumping heap profile to /tmp/heapprofile.0009.heap (902 MB currently in use) Dumping heap profile to /tmp/heapprofile.0010.heap (1002 MB currently in use) Dumping heap profile to /tmp/heapprofile.0011.heap (2029 MB allocated cumulatively, 1031 MB currently in use) Dumping heap profile to /tmp/heapprofile.0012.heap (3053 MB allocated cumulatively, 1030 MB currently in use) Dumping heap profile to /tmp/heapprofile.0013.heap (4078 MB allocated cumulatively, 1031 MB currently in use) Dumping heap profile to /tmp/heapprofile.0014.heap (5102 MB allocated cumulatively, 1031 MB currently in use) Dumping heap profile to /tmp/heapprofile.0015.heap (6126 MB allocated cumulatively, 1033 MB currently in use) Dumping heap profile to /tmp/heapprofile.0016.heap (7151 MB allocated cumulatively, 1029 MB currently in use) Dumping heap profile to /tmp/heapprofile.0017.heap (8175 MB allocated cumulatively, 1029 MB currently in use) Dumping heap profile to /tmp/heapprofile.0018.heap (9199 MB allocated cumulatively, 1028 MB currently in use) Id 0 about to release all memory: 258.144 MB Id 0 done, total alloc'ed -1572.7MB Id 2 about to release all memory: 257.043 MB Id 2 done, total alloc'ed -1569.96MB Id 3 about to release all memory: 257.854 MB Id 3 done, total alloc'ed -1569.6MB Id 1 about to release all memory: 257.339 MB Id 1 done, total alloc'ed -1570.42MB Dumping heap profile to /tmp/heapprofile.0019.heap (Exiting) real 0m11.981s user 0m44.455s sys 0m1.124s
To the comments: I updated the program
--- omptest/openMpMemtest_Linux.cpp 2011-05-03 23:18:44.000000000 +0200 +++ q/openMpMemtest_Linux.cpp 2011-05-04 13:42:47.371726000 +0200 @@ -13,8 +13,8 @@ void runParallelAllocTest() { // constants - const int NUM_ALLOCATIONS = 5000; // alloc's per thread - const int NUM_THREADS = 4; // how many threads? + const int NUM_ALLOCATIONS = 55000; // alloc's per thread + const int NUM_THREADS = 8; // how many threads? const int NUM_ITERS = NUM_THREADS;// how many overall repetions const bool USE_NEW = true; // use new or malloc? , seems to make no difference (as it should)
It ran for over 5m3s. Close to the end, a screenshot of htop teaches that indeed, the reserved set is slightly higher, going towards 2.3g:
1 [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%] Tasks: 125 total, 2 running 2 [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%] Load average: 8.09 5.24 2.37 3 [||||||||||||||||||||||||||||||||||||||||||||||||||97.4%] Uptime: 01:54:22 4 [||||||||||||||||||||||||||||||||||||||||||||||||||96.1%] Mem[||||||||||||||||||||||||||||||| 3055/7936MB] Swp[ 0/0MB] PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 4330 sehe 8 20 0 2635M 2286M 908 R 368. 28.8 15:35.01 ./ompmemtest
Comparing results with a tcmalloc run: 4m12s, similar top stats has minor differences; the big difference is in the VIRT set (but that isn't particularly useful unless you have a very limited address space per process?). The RES set is quite similar, if you ask me. The more important thing to note is parallellism is increased; all cores are now maxed out. This is obviously due to reduced need to lock for heap operations when using tcmalloc:
If the free list is empty: (1) We fetch a bunch of objects from a central free list for this size-class (the central free list is shared by all threads). (2) Place them in the thread-local free list. (3) Return one of the newly fetched objects to the applications.
1 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Tasks: 172 total, 2 running 2 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Load average: 7.39 2.92 1.11 3 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Uptime: 11:12:25 4 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Mem[|||||||||||||||||||||||||||||||||||||||||||| 3278/7936MB] Swp[ 0/0MB] PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 14391 sehe 8 20 0 2251M 2179M 1148 R 379. 27.5 8:08.92 ./ompmemtest
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With