Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

If memory pools are faster than malloc, why doesn't malloc use them under the covers?

I keep hearing that memory pools can signficantly improve performance when allocating memory.. so why aren't they used in some way by traditional malloc implementations?

I know part of it is that memory pools use fixed sized blocks of memory but seems like some don't and the only thing they require is grabbing a little extra memory in advance. Is there a way they could be sufficiently generalized for such purposes?

like image 358
mczarnek Avatar asked May 08 '20 20:05

mczarnek


People also ask

What does malloc do under the hood?

How works malloc under-the-hood? malloc and free are, in fact, functions included in all C programs that will talk to Linux on your behalf. It will also makes dynamic allocation easier because, at start, Linux doesn't allow you to allocate variables of all sizes.

What happens if malloc runs out of memory?

If the malloc function is unable to allocate the memory buffer, it returns NULL. Any normal program should check the pointers which the malloc function returns and properly handle the situation when the memory allocation failed.

Why malloc is faster than calloc?

In malloc function, number of arguments is 1 while in calloc function, number of argument is 2. malloc() time efficiency is higher than calloc() whereas malloc() is not secure as compared to calloc() malloc does not initialize memory whereas calloc performs memory initialization.

Why malloc is not secure?

In contrast, malloc does not touch the contents of the allocated block of memory, which means it contains garbage values. This could potentially be a security risk because the contents of memory are unpredictable and programming errors may result in a leak of these contents.


2 Answers

Memory pools can be more efficient than generalised memory allocation, but usually only because you have extra information about the allocation patterns. Perhaps their most important property is that they are deterministic in their run time, especially important in, for example, real-time operating systems.

As an example, I once coded up an embedded system where I knew the largest allocation needed was 128 bytes (called a chunk hereafter). To that end, I maintained a contiguous set of chunks and used a map for deciding whether one was free or not.

It was originally a bitmap but we ended up getting more performance simply by storing each used/unused flag in a separate byte. Eight times as much memory usage for the map but, since the pool size was known and reasonably limited (a thousand or so), that wasn't too bad. And it gave us even more speed by not having to do bit twiddling to do pool management.

And there were other optimisations we added like storing the first free block so we could quickly find it. It was easy to maintain since:

  • freeing a block lower than the current lowest would simply update the lowest; and
  • allocating the lowest block would simply increment the lowest - while that didn't guarantee it pointed at a free block, it would still make the search faster, and avoided a possibly unnecessary search on allocation (if, for example, you first freed a block lower than the one you just allocated).

Then, if you asked for more than the chunk size, it returned NULL (this never happened in that system but, being paranoid, I coded for it just in case). If you asked for something that would fit in a chunk, you got a full chunk regardless (but, of course, you should still only use the memory you asked for, just in case I wanted to later change the chunk size or allocate from separate pools with differing chunk sizes).

This turned out to be much faster than the generalised allocators that were around at the time, since they had to handle differing request sizes and worry about things like coalescing contiguous free blocks when freeing memory.

But it required that extra knowledge, the fact that no allocation would ever be more than the chunk size.


Another model is to have a pool for requests below a certain size but to revert to general allocation if either:

  • you requested a block beyond the chunk size; or
  • the pool was currently exhausted.

That allows you the extra efficiency in most cases (depending on your allocation patterns, of course) but allowing for allocation beyond that. It introduces a little extra effort in every allocation in that you need to evaluate request size and pool exhaustion, but it still may out-perform the general case.

As an aside, I recall something similar in Java strings (not sure if this is still the case, I haven't used Java for quite a while). The string object allocation had a buffer inside it for storing small strings but could also use that space to store a pointer for a separately allocated chunk of characters (if it was bigger than the internal buffer). This reduced fragmentation (and dereferencing) for what was possibly a large number of small strings, but still allowed for larger strings if needed.


Interestingly, I once tried an experiment in the CPython source code to see if a memory pool could improve performance, especially given the quantity of memory allocations that goes on in there. It used a strategy similar to the one given above, allocating from the pool in preference, but reverting to the original strategy if the requested size was beyond the chunk size or if the pool was exhausted.

Again, it had optimisations as discussed, and then some. For example, the last block freed was cached so it could be handed out immediately without any searching of the pool, in an attempt to speed up the many-times(single-free-then-allocate) pattern.

However, even with a variety of optimisations, pool and chunk sizes, it appeared to make no substantial difference to the performance of some test code I wrote to exercise it, leading me to believe that the actual allocator used in CPython is already pretty good.


And, just having finished reading the excellent book I purchased a couple of weeks ago(a), I now know why I didn't make any headway.

Turns out CPython is already heavily optimised, including the use of memory pools. The "Memory Management" chapter goes into a lot more detail but it basically only uses the normal allocator (raw domain) for getting large blocks (> 256K) or specific non-object-related memory.

All objects, and Python is pretty much all objects :-), come from the object domain (other than a few legacy things).

For this domain, it maintains its own heap and allocates arenas sized to match the system page size, using mmap if supported to reduce fragmentation. All the used arenas are held in a doubly-linked list, the empty ones are maintained in a singly-linked free list.

Within each arena, 4K pools are created (so 64 per arena) and a pool can only serve one size of allocation, locked in when the first allocation is requested from that pool. For example, requests for 1-16 bytes will get a 16-byte block from a pool serving 16-byte blocks, 33-48 byte requests will come from a pool serving 48-byte blocks.

Note this is for a 64-bit system where the block sizes are {16, 32, 48, ..., 512}. 32-bit systems have a slightly different set of block sizes, {8, 16, 24, 32, ..., 512}.

For pools within an arena, they are either:

  • partially used, in which case they live on a doubly-linked list in the arena, based on their block size. The arena maintains free lists for the pools, one per block size.
  • unused, in which case they live on a free list of pools, able to serve any request size (though, once locked in, that's the block size they're limited to).
  • full, in which case they are made inaccessible, except for de-allocation.

Keep in mind that transitions between those three states all result in pools moving from list to list.

I won't go into any more detail since your head may explode, like mine almost did :-)

In short, CPython object allocations are always for a specific block size, the smallest one larger or equal to what you need. These come from pools that serve a single block size (once locked in). These pools exist in arenas heavily optimised for preventing fragmentation. And arenas can be created as needed.

Suffice to say, this is why my little experiment didn't improve CPython: it already does memory pooling in a quite complex, yet efficient, way and my attempt to intercept malloc did no good at all.

And my opening statement that pooled memory can be more efficient "but usually only because you have extra information about the allocation patterns" is supported by a comment from that book:

Most of the memory allocation requests are small and of a fixed size. because PyObject is 16 bytes, PyASCIIObject is 42 bytes, PyCompactUnicodeObject is 72 bytes, and PyLongObject is 32 bytes.


(a)CPython Internals if you're interested, I have no affiliation other than the fact I like good tech books about how things work.

like image 97
paxdiablo Avatar answered Oct 10 '22 02:10

paxdiablo


As usual, it all depends.
In this case it depends mainly on what you mean by performance.

As others already said, memory pools are usually faster than standard malloc and free implementations, but the speed is not the only factor one must consider in general case. A general allocator should not allocate too much data in advance, until necessary (which pools usually do) and it should allocate blocks of arbitrary size (which pools usually not do).

Memory pools are more efficient in allocating large amounts of small blocks, especially of the same size, because they can be allocated as an array, so a bunch of blocks has a single common header instead of a separate header for each block.
On the other hand, not the whole bunch will be used up usually, which may be seen as a loss of memory efficiency.

Pools can be faster in malloc/new because they can almost instantly give you a block of data from a pre-allocated array of blocks of a fitting size, instead of searching for a fitting block to slice from. However, you can't have a pool for every possible size, so usually you get blocks a bit longer than needed, which again is some memory loss.

They can also be faster in free/delete, because they just mark a block in a pool as freed and don't need to seek for adjacent blocks to see if they're free, too, and 'glue' the newly freed block to them.

like image 31
CiaPan Avatar answered Oct 10 '22 02:10

CiaPan