When processing some stream of data, e.g., requests from a network, it is quite common that some temporary memory is used. For example, a URL may be split into multiple strings, each one possibly allocating memory from the heap. The use of these entities is often short-lived and the total amount of memory is often relatively small and should fit into a CPU cache.
At the point the memory used for a temporary string gets released the string content may very well have only lived within the cache. However, the CPU is unaware of the memory being deallocated: the deallocation is just an update in the memory management system. As a result, the CPU may end up writing the unused content unnecessarily to actual memory when the CPU cache is used for other memory - unless the memory release somehow indicates to the CPU that the memory isn't used anymore. Hence, the question becomes:
Do memory management functions releasing memory somehow indicate that the content of the corresponding memory can be discarded? Is there even a way to indicate to the CPU that memory is no longer used? (at least, for some CPUs: there may, obviously, be differences between architectures) Since different implementations will likely differ in quality and may or may not do anything fancy, the question really is if there is any memory management implementation indicating memory as unused?
I do realize that always using the same arena of memory may be a mitigation strategy to avoid unnecessary writes to the actual memory. In that case the same cached memory would be used. Similarly, it may be likely that the memory allocation always yields the same memory also avoiding unnecessary memory transfers. However, possibly I don't need to rely on any of these techniques being applicable.
The cache operation you mention (marking cached memory as unused and discarding without writeback to main memory) is called cacheline invalidation without writeback. This is performed through a special instruction with an operand that may (or may not) indicate the address of the cacheline to be invalidated.
On all architectures I'm familiar with, this instruction is privileged, with good reason in my opinion. This means that usermode code cannot employ the instruction; Only the kernel can. The amount of perverted trickery, data loss and denial of service that would be possible otherwise is incredible.
As a result, no memory allocator could do what you propose; They simply don't have (in usermode) the tools to do so.
invd
instruction, which invalidates all internal caches without writeback and directs external caches to invalidate themselves also. This is the only instruction capable of invalidating without writeback, and it is a blunt weapon indeed.
clflush
instruction specifies a victim address, but it writes back before invalidating, so I mention it only in passing.MCR p15, 0, <Rd>, c7, <CRm>, <Opcode_2>
. A victim cacheline may be specified. Writes to this register are privileged.dcbi
, which lets you specify a victim, dci
which doesn't and instruction-cache versions of both, but all four are privileged (see page 1400).CACHE
instruction which can specify a victim. It was privileged as of MIPS Instruction Set v5.04, but in 6.04 Imagination Technologies muddied the water and it's no longer clear what's privileged and what not.So this excludes the use of cache invalidation without flushing/writing back in usermode outright.
However, I'd argue that it's still a bad idea in kernelmode for numerous reasons:
kmalloc()
, allocates out of arenas for different sizes of allocations. In particular, it has an arena for each allocation size <=192
bytes by steps of 8
; This means that objects can be potentially closer to each other than a cacheline or partially overlap the next one, and using invalidation could thus blow out nearby objects that were rightly in cache and not yet written back. This is wrong.
glibc
does not invalidate freed memory.jemalloc
does not invalidate freed memory.musl
-libc's malloc()
does not invalidate freed memory.None of them invalidate memory, because they can't. Doing a system call for the sake of invalidating cachelines would be both incredibly slow and would cause far more traffic in/out of cache, just because of the context switch.
I am not aware of any architecture that would willingly expose its cache coherence protocols to software (user or even kernel) manipulation like this. This would create caveats that are practically impossible to handle. Note that user-initiated flushing is an acceptable exposure but in no way threatens to break memory coherency.
Just as an example, imagine you have a cache line with a temporary data you no longer need. Since it was written to, it would be in "modified" state in the cache. Now you want a mechanism that tells the cache to avoid writing it back, but that means that you create a race condition - if someone else were to look for the line before you applied this trick, he would have snooped it out of the core and received the updated data. If you core had its way first, the new data would have been lost - therefore, the outcome of that address in memory depends on a race.
You might argue that in multithreaded programming that is often the case, but this scenario may also occur when running a single thread (the CPU may voluntarily evict a line earlier if the cache is full, or some lower inclusive level loses it). Worse, this breaks the premise that the entire virtual memory appears as flat, and cached versions are maintained by the CPU only for performance, but can not break coherency or consistency (except in some documented multithreaded cases depending on memory ordering model, which can be overcome by software protection).
Edit: If you are willing to extend the definition of what you consider as "memory", you could seek non-coherent types of memory, which differ in definition and implementation but some may provide what you seek. Some architectures expose "scratchpad" memory, which is controlled by the user and allows fast access without the hassle of cache coherency (but also without its benefits). Some architectures even go as far as providing configurable hardware that allows you to select whether you prefer to cache main memory in it, or to use it as a scratchpad area.
This is very much a matter of the implementation and the library that you are using. Allocated and freed memory tends to be reallocated very quickly. Most allocations are in small blocks much smaller than a page that would be written to backing storage when needed.
And today, RAM sizes are typically so large that when the OS starts writing dirty pages to backing store, you are in trouble no matter what. If you have 16 GB of RAM, you won't be writing hundred kilobytes or a megabyte, you will be writing gigabytes and your computer will slow down to a crawl. The user will avoid the situation by not using applications that use too much memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With