Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why deallocating heap memory is much slower than allocating it?

This is an empirical assumption (that allocating is faster then de-allocating).

This is also one of the reason, i guess, why heap based storages (like STL containers or else) choose to not return currently unused memory to the system (that is why shrink-to-fit idiom was born).

And we shouldn't confuse, of course, 'heap' memory with the 'heap'-like data structures.


So why de-allocation is slower?

Is it Windows-specific (i see it on Win 8.1) or OS independent?

Is there some C++ specific memory manager automatically involved on using 'new' / 'delete' or the whole mem. management is completely relies on the OS? (i know C++11 introduced some garbage-collection support, which i never used really, better relying on the old stack and static duration or self managed containers and RAII).

Also, in the code of the FOLLY string i saw using old C heap allocation / deallocation, is it faster then C++ 'new' / 'delete'?


P. S. please note that the question is not about virtual memory mechanics, i understand that user-space programs didn't use real mem. addresation.

like image 866
Amabo Carab Avatar asked Jun 25 '16 16:06

Amabo Carab


2 Answers

The assertion that allocating memory is faster than deallocating it seemed a bit odd to me, so I tested it. I ran a test where I allocated 64MB of memory in 32-byte chunks (so 2M calls to new), and I tried deleting that memory in the same order it was allocated, and in a random order. I found that linear-order deallocation was about 3% faster than allocation, and that random deallocation was about 10% slower than linear allocation.

I then ran a test where I started with 64MB of allocated memory, and then 2M times either allocated new memory or deleted existing memory (at random). Here, I found that deallocation was about 4.3% slower than allocation.

So, it turns out you were correct - deallocation is slower than allocation (though I wouldn't call it "much" slower). I suspect this has simply to do with more random accesses, but I have no evidence for this other than that the linear deallocation was faster.

To answer some of your questions:

Is there some C++ specific memory manager automatically involved on using 'new' / 'delete'?

Yes. The OS has system calls which allocate pages of memory (typically 4KB chunks) to processes. It's the process' job to divide up those pages into objects. Try looking up the "GNU Memory Allocator."

I saw using old C heap allocation / deallocation, is it faster then C++ 'new' / 'delete'?

Most C++ new/delete implementations just call malloc and free under the hood. This is not required by the standard, however, so it's a good idea to always use the same allocation and deallocation function on any particular object.

I ran my tests with the native testing framework provided in Visual Studio 2015, on a Windows 10 64-bit machine (The tests were also 64-bit). Here's the code:

#include "stdafx.h"
#include "CppUnitTest.h"

using namespace Microsoft::VisualStudio::CppUnitTestFramework;

namespace AllocationSpeedTest
{       
    class Obj32 {
        uint64_t a;
        uint64_t b;
        uint64_t c;
        uint64_t d;
    };
    constexpr int len = 1024 * 1024 * 2;
    Obj32* ptrs[len];
    TEST_CLASS(UnitTest1)
    {
    public:
        TEST_METHOD(Linear32Alloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
        }
        TEST_METHOD(Linear32AllocDealloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            for (int i = 0; i < len; ++i) {
                delete ptrs[i];
            }
        }
        TEST_METHOD(Random32AllocShuffle)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                int pos = (rand() % (len - i)) + i;
                Obj32* temp = ptrs[i];
                ptrs[i] = ptrs[pos];
                ptrs[pos] = temp;
            }
        }
        TEST_METHOD(Random32AllocShuffleDealloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                int pos = (rand() % (len - i)) + i;
                Obj32* temp = ptrs[i];
                ptrs[i] = ptrs[pos];
                ptrs[pos] = temp;
            }
            for (int i = 0; i < len; ++i) {
                delete ptrs[i];
            }
        }
        TEST_METHOD(Mixed32Both)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                if (rand() % 2) {
                    ptrs[i] = new Obj32();
                }
                else {
                    delete ptrs[i];
                }
            }
        }
        TEST_METHOD(Mixed32Alloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                if (rand() % 2) {
                    ptrs[i] = new Obj32();
                }
                else {
                    //delete ptrs[i];
                }
            }
        }
        TEST_METHOD(Mixed32Dealloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                if (rand() % 2) {
                    //ptrs[i] = new Obj32();
                }
                else {
                    delete ptrs[i];
                }
            }
        }
        TEST_METHOD(Mixed32Neither)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                if (rand() % 2) {
                    //ptrs[i] = new Obj32();
                }
                else {
                    //delete ptrs[i];
                }
            }
        }
    };
}

And here are the raw results over several runs. All numbers are in milliseconds. Table of raw results

like image 154
IanPudney Avatar answered Oct 14 '22 14:10

IanPudney


I had much the same idea as @Basile: I wondered whether your base assumption was actually (even close to) correct. Since you tagged the question C++, I wrote a quick benchmark in C++ instead.

#include <vector>
#include <iostream>
#include <numeric>
#include <chrono>
#include <iomanip>
#include <locale>

int main() {
    std::cout.imbue(std::locale(""));

    using namespace std::chrono;
    using factor = microseconds;

    auto const size = 2000;

    std::vector<int *> allocs(size);

    auto start = high_resolution_clock::now();

    for (int i = 0; i < size; i++)
        allocs[i] = new int[size];

    auto stop = high_resolution_clock::now();
    auto alloc_time = duration_cast<factor>(stop - start).count();

    start = high_resolution_clock::now();

    for (int i = 0; i < size; i++)
        delete[] allocs[i];

    stop = high_resolution_clock::now();

    auto del_time = duration_cast<factor>(stop - start).count();

    std::cout << std::left << std::setw(20) << "alloc time: " << alloc_time << " uS\n";
    std::cout << std::left << std::setw(20) << "del time: " << del_time << " uS\n";
}

I also used VC++ on Windows instead of gcc on Linux. The result wasn't much different though: freeing the memory took substantially less time than allocating it did. Here are the results from three successive runs.

alloc time:         2,381 uS
del time:           1,429 uS

alloc time:         2,764 uS
del time:           1,592 uS

alloc time:         2,492 uS
del time:           1,442 uS

I'd warn, however, allocation and freeing is handled (primarily) by the standard library, so this could be different between one standard library and another (even when using the same compiler). I'd also note that it wouldn't surprise me if this were to change somewhat in multi-threaded code. Although it's not actually correct, there appear to be a few authors who are under the mis-apprehension that freeing in a multithreaded environment requires locking a heap for exclusive access. This can be avoided, but the means to do so isn't necessarily immediately obvious.

like image 43
Jerry Coffin Avatar answered Oct 14 '22 14:10

Jerry Coffin