Do memory allocation functions indicate that the memory content is no longer used?

Question

When processing some stream of data, e.g., requests from a network, it is quite common that some temporary memory is used. For example, a URL may be split into multiple strings, each one possibly allocating memory from the heap. The use of these entities is often short-lived and the total amount of memory is often relatively small and should fit into a CPU cache.

At the point the memory used for a temporary string gets released the string content may very well have only lived within the cache. However, the CPU is unaware of the memory being deallocated: the deallocation is just an update in the memory management system. As a result, the CPU may end up writing the unused content unnecessarily to actual memory when the CPU cache is used for other memory - unless the memory release somehow indicates to the CPU that the memory isn't used anymore. Hence, the question becomes:

Do memory management functions releasing memory somehow indicate that the content of the corresponding memory can be discarded? Is there even a way to indicate to the CPU that memory is no longer used? (at least, for some CPUs: there may, obviously, be differences between architectures) Since different implementations will likely differ in quality and may or may not do anything fancy, the question really is if there is any memory management implementation indicating memory as unused?

I do realize that always using the same arena of memory may be a mitigation strategy to avoid unnecessary writes to the actual memory. In that case the same cached memory would be used. Similarly, it may be likely that the memory allocation always yields the same memory also avoiding unnecessary memory transfers. However, possibly I don't need to rely on any of these techniques being applicable.

Iwillnotexist Idonotexist · Accepted Answer

No.

The cache operation you mention (marking cached memory as unused and discarding without writeback to main memory) is called cacheline invalidation without writeback. This is performed through a special instruction with an operand that may (or may not) indicate the address of the cacheline to be invalidated.

On all architectures I'm familiar with, this instruction is privileged, with good reason in my opinion. This means that usermode code cannot employ the instruction; Only the kernel can. The amount of perverted trickery, data loss and denial of service that would be possible otherwise is incredible.

As a result, no memory allocator could do what you propose; They simply don't have (in usermode) the tools to do so.

Architectural Support

The x86 and x86-64 architecture has the privileged invd instruction, which invalidates all internal caches without writeback and directs external caches to invalidate themselves also. This is the only instruction capable of invalidating without writeback, and it is a blunt weapon indeed.
- The non-privileged clflush instruction specifies a victim address, but it writes back before invalidating, so I mention it only in passing.
- Documentation for all these instructions is in Intel's SDMs, Volume 2.
The ARM architecture performs cache invalidation without writeback with a write to coprocessor 15, register 7: MCR p15, 0, <Rd>, c7, <CRm>, <Opcode_2>. A victim cacheline may be specified. Writes to this register are privileged.
PowerPC has dcbi, which lets you specify a victim, dci which doesn't and instruction-cache versions of both, but all four are privileged (see page 1400).
MIPS has the CACHE instruction which can specify a victim. It was privileged as of MIPS Instruction Set v5.04, but in 6.04 Imagination Technologies muddied the water and it's no longer clear what's privileged and what not.

So this excludes the use of cache invalidation without flushing/writing back in usermode outright.

Kernel mode?

However, I'd argue that it's still a bad idea in kernelmode for numerous reasons:

Linux's allocator, kmalloc(), allocates out of arenas for different sizes of allocations. In particular, it has an arena for each allocation size <=192 bytes by steps of 8; This means that objects can be potentially closer to each other than a cacheline or partially overlap the next one, and using invalidation could thus blow out nearby objects that were rightly in cache and not yet written back. This is wrong.
- This problem is compounded by the fact that cache lines may be quite large (on x86-64, 64 bytes), and moreover are not necessarily uniform in size throughout the cache hierarchy. For instance, Pentium 4 had 64B L1 cachelines but 128B L2 cachelines.
It makes the deallocation time linear in the number of cachelines of the object to deallocate.
It has a very limited benefit; The size of L1 cache is usually in the KB's, so a few thousand flushes will fully empty it. Moreover the cache may already have flushed the data without your prompting, so your invalidation is worse than useless: The memory bandwidth was used, but you no longer have the line in cache, so when it will next be partially written to it will need to be refetched.
The next time the memory allocator returns that block, which might be soon, the user thereof will suffer a guaranteed cache miss and fetch from main RAM, while he could have had a dirty unflushed line or a clean flushed line instead. The cost of a guaranteed cache miss and fetch from main RAM is much greater than a cache line flush without invalidation tucked in somewhere automatically and intelligently by the caching hardware.
The additional code required to loop and flush these lines wastes instruction cache space.
A better use for the dozens of cycles taken by aforementioned loop to invalidate cachelines would be to keep doing useful work, while letting the cache and memory subsystem's considerable bandwidth write back your dirty cachelines.
- My modern Haswell processor has 32 bytes / clock cycle write L1 bandwidth and 25GB/s main RAM bandwidth. I'm sure a couple extra flushable 32-byte cachelines can be squeezed in somewhere in there.
Lastly, for short-lived, small allocations like that, there's the option of allocating it on the stack.

Actual memory allocator practice

The famed dlmalloc does not invalidate freed memory.
glibc does not invalidate freed memory.
jemalloc does not invalidate freed memory.
musl-libc's malloc() does not invalidate freed memory.

None of them invalidate memory, because they can't. Doing a system call for the sake of invalidating cachelines would be both incredibly slow and would cause far more traffic in/out of cache, just because of the context switch.

Leeor · Answer

I am not aware of any architecture that would willingly expose its cache coherence protocols to software (user or even kernel) manipulation like this. This would create caveats that are practically impossible to handle. Note that user-initiated flushing is an acceptable exposure but in no way threatens to break memory coherency.

Just as an example, imagine you have a cache line with a temporary data you no longer need. Since it was written to, it would be in "modified" state in the cache. Now you want a mechanism that tells the cache to avoid writing it back, but that means that you create a race condition - if someone else were to look for the line before you applied this trick, he would have snooped it out of the core and received the updated data. If you core had its way first, the new data would have been lost - therefore, the outcome of that address in memory depends on a race.

You might argue that in multithreaded programming that is often the case, but this scenario may also occur when running a single thread (the CPU may voluntarily evict a line earlier if the cache is full, or some lower inclusive level loses it). Worse, this breaks the premise that the entire virtual memory appears as flat, and cached versions are maintained by the CPU only for performance, but can not break coherency or consistency (except in some documented multithreaded cases depending on memory ordering model, which can be overcome by software protection).

Edit: If you are willing to extend the definition of what you consider as "memory", you could seek non-coherent types of memory, which differ in definition and implementation but some may provide what you seek. Some architectures expose "scratchpad" memory, which is controlled by the user and allows fast access without the hassle of cache coherency (but also without its benefits). Some architectures even go as far as providing configurable hardware that allows you to select whether you prefer to cache main memory in it, or to use it as a scratchpad area.

gnasher729 · Answer

This is very much a matter of the implementation and the library that you are using. Allocated and freed memory tends to be reallocated very quickly. Most allocations are in small blocks much smaller than a page that would be written to backing storage when needed.

And today, RAM sizes are typically so large that when the OS starts writing dirty pages to backing store, you are in trouble no matter what. If you have 16 GB of RAM, you won't be writing hundred kilobytes or a megabyte, you will be writing gigabytes and your computer will slow down to a crawl. The user will avoid the situation by not using applications that use too much memory.

Do memory allocation functions indicate that the memory content is no longer used?

Tags:

performance

memory-management

dynamic-memory-allocation

memory

cpu-cache

Dietmar Kühl

Video Answer

3 Answers

No.

Architectural Support

Kernel mode?

Actual memory allocator practice

Iwillnotexist Idonotexist

Leeor

gnasher729

Recent Activity

Donate For Us

Do memory allocation functions indicate that the memory content is no longer used?

Tags:

performance

memory-management

dynamic-memory-allocation

memory

cpu-cache

Dietmar Kühl

Video Answer

3 Answers

No.

Architectural Support

Kernel mode?

Actual memory allocator practice

Iwillnotexist Idonotexist

Leeor

gnasher729

Related questions

Recent Activity

Donate For Us