Why does repeated memory allocation in windows slow down?

Question

I want to understand why the following code behaves differently on my linux and windows 7 machines: On linux it takes ~120ms per iteration. On windows 7 already the first iteration takes 0.4 seconds and subsequent iterations take much longer. Iteration 8 already takes about 11 seconds, iteration 22 takes roughly 1 minute.

I observed this behaviour on different hardware. It seems to be related to windows.

#include <iostream>
#include <time.h>
#include <chrono>

void iteration() {
  int n = 25000;
  // Allocate memory
  long** blocks = new long*[n];
  for( int i = 0; i<n; ++i )
  {
    blocks[i] = new long[100008];
  }
  // Free all allocated memory
  for( int i = 0; i<n; ++i )
  {
    delete[] blocks[i];
  }
  delete[] blocks;
}

int main(int argc, char **argv) {
  int nbIter = 30;
  for( int i = 0; i < nbIter; ++i )
  {
    auto start = std::chrono::system_clock::now();
    iteration();
    auto end = std::chrono::system_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << "Iteration #" << i << ": time=" << elapsed.count() << "ms" << std::endl;
  }
  return 0;
}

Can anyone tell me what happens here, and how to get the code to run stable on windows?

edit: I did a release build in VS2013 on windows, I executed the program from outside VS. here are some more exact running times (in seconds):

Iteration #0: time=0.381000
Iteration #1: time=0.391000
Iteration #2: time=0.451000
Iteration #3: time=1.507000
Iteration #4: time=1.711000
Iteration #5: time=2.938000
Iteration #6: time=4.770000
Iteration #7: time=7.840000
Iteration #8: time=10.563000
Iteration #9: time=14.606000
Iteration #10: time=20.732000
Iteration #11: time=24.948000
Iteration #12: time=30.255000
Iteration #13: time=34.608000
Iteration #14: time=38.114000
Iteration #15: time=43.217000
Iteration #16: time=39.926000
Iteration #17: time=43.506000
Iteration #18: time=43.660000
Iteration #19: time=45.291000
Iteration #20: time=50.003000

txtechhelp · Accepted Answer

Digging around in the references about the heap on Windows (as well as some infosec articles on it), there are some tidbits about common heap slow downs which states that a few are

Slowdown as a result of allocation operations.
Slowdown as a result of free operations.
Slowdown as a result of frequent allocs and reallocs.

Which does help to explain why there is a slow down (i.e. frequent allocs and reallocs), but it doesn't really explain why there is a slow down.

The first thing that should be noted is that sizeof(long) != sizeof(long), that is, in the tests I did with 64-bit builds using g++ and Visual Studio 12, sizeof(long) on Windows was 4 and on Linux it was 8. This is an important note when allocating/deallocating memory. If you change your code from the long type to a type where sizeof(T) == 8 (like long long) then the issue goes away and the times are consistent across iterations. Example:

void iteration() {
    int n = 25000;
    // Allocate memory
    long long** blocks = new long long*[n];
    for (int i = 0; i < n; ++i) {
        blocks[i] = new long long[100008];
    }
    // Free all allocated memory
    for (int i = 0; i < n; ++i) {
        delete[] blocks[i];
    }
    delete[] blocks;
}
// sizeof(long long) == 8 on my Linux/Unix and Windows 64-bit machines

It should also be noted that the timing issue goes away only for exactly this code.

If you keep the type of long long but adjusted 100008 to say 16666, the issue comes up again; further, if you changed it to 16668 and ran the long long iteration immediately followed by the long version, the timing would go up for the long long function, then down for the long, example:

template < typename T >
void iteration() {
    int n = 25000;
    // Allocate memory
    T** blocks = new T*[n];
    for (int i = 0; i < n; ++i) {
        blocks[i] = new T[16668];
    }
    // Free all allocated memory
    for (int i = 0; i < n; ++i) {
        delete[] blocks[i];
    }
    delete[] blocks;
}

for (int i = 0; i < nbItrs; ++i) {
    iteration<long long>(); // time goes UP
}
for (int i = 0; i < nbItrs; ++i) {
    iteration<long>(); // time goes DOWN
}

Additionally the posted code produces similar results using malloc/free, LocalAlloc/LocalFree and/or HeapAlloc/HeapFree since new/malloc (on Windows) calls HeapAlloc. The reason has to do with how Windows manages it's heap memory and the locations of the freed memory. When pages have to be deleted, clean up on the free block list needs to be done and the list might need to be adjusted accordingly.

It's this adjustment that can take time, during the search and replace or removal of the old blocks of memory from the list. If the blocks don't fall on clean boundaries, additional adjustments might need to be done to the list of free heap blocks.

Going deeper into the how and why of Windows heap management would involve explaining the design of the Windows kernel and it's memory management. Getting into that would go beyond the scope of this question/answer but, some of the articles I linked to above have great overviews and explain quite well that how and why.

However, you did ask

how to get the code to run stable on windows?

As stated above, changing the type would allow more consistent timing, additionally, as stated in another answer, deleting the list in reverse order will also achieve more consistent timings;

for (int i = n; i > 0; --i )
{
    delete[] blocks[i-1];
}

This is due to the fact that the Windows memory manger uses a singly linked list to maintain the heap locations, hence why the times can go up when deleteing since the list is being traversed and why the times can be slower on Windows compared to Linux (though my tests actually produced similar times in both doing these changes).

I hope that can help.

Tony Delroy · Answer

Interesting problem. I was able to reproduce.

I get consistent - albeit still somewhat sluggish - performance by delete[]-ing the blocks in reverse order of their allocation:

for( int i = 0; i<n; ++i )
    delete[] blocks[n - 1 - i];

I suspect it may all relate to coalescing overheads - from MSDN here:

Slowdown as a result of free operations. Free operations consume more cycles, mainly if coalescing is enabled. During coalescing, each free operation should "find" its neighbors, pull them out to construct a larger block, and reinsert the larger block into the free list. During that find, memory may be touched in a random order, causing cache misses and performance slowdown.

There are few weird things about it though:

my measurements showed that while delete[] took ~80% of the time on the first iteration or three, after half a dozen more new[] was taking almost as long.
the problem kicked in very suddenly as I went from new long[91134] to ...91135: that's very nearly 356kb, but I didn't manage to google anything related.

Why does repeated memory allocation in windows slow down?

Tags:

c++

windows

user6234011

2 Answers

txtechhelp

Tony Delroy

Recent Activity

Donate For Us

Why does repeated memory allocation in windows slow down?

Tags:

c++

windows

user6234011

2 Answers

txtechhelp

Tony Delroy

Related questions

Recent Activity

Donate For Us