I have a bunch of buffers (25 to 30 of them) in my application that are are fairly large (.5mb) and accessed simulataneousley. To make it even worse the data in them is generally only read once, and it is updated frequently (like 30 times per second). Sort of the perfect storm of non optimal cache use. Anyhow, it occurred to me that it would be cool if I could mark a block of memory as non cacheable... Theoretically, this would leave more room in the cache for everything else. So, is their a way to get a block of memory marked as non cacheable in Linux?

Frequently updated data actually is the perfect application of cache. As jdt mentioned, modern CPU caches are quite large, and 0.5mb might well fit in cache. More importantly, though, read-modify-write to uncached memory is VERY slow - the initial read has to block on memory, then the write operation ALSO has to block on memory in order to commit. And just to add insult to injury, the CPU might implement no-cache memory by loading the data into cache, then immediately invalidating the cache line - thus leaving you in a position which is guaranteed to be worse than before. Before you try outsmarting the CPU like this, you really should benchmark the entire program, and see where the real slowdown is. Modern profilers such as valgrind's cachegrind can measure cache misses, so you can find if that is a significant source of slowdown as well. On another, more practical note, if you're doing 30 RMWs per second, this is at the worst case something on the order of 1920 bytes of cache footprint. This is only 1/16 of the L1 size of a modern Core 2 processor, and likely to be lost in the general noise of the system. So don't worry about it too much :) That said, if by 'accessed simultaneously' you mean 'accessed by multiple threads simultaneously', be careful about cache lines bouncing between CPUs. This wouldn't be helped by uncached RAM - if anything it'd be worse, as the data would have to travel all the way back to physical RAM each time instead of possibly passing through the faster inter-CPU bus - and the only way to avoid it as a problem is to minimize the frequency of access to shared data. For more about this, see http://www.ddj.com/hpc-high-performance-computing/217500206

Is it possible to allocate, in user space, a non cacheable block of memory on Linux?

Tags:

linux

memory

caching

I have a bunch of buffers (25 to 30 of them) in my application that are are fairly large (.5mb) and accessed simulataneousley. To make it even worse the data in them is generally only read once, and it is updated frequently (like 30 times per second). Sort of the perfect storm of non optimal cache use.

Anyhow, it occurred to me that it would be cool if I could mark a block of memory as non cacheable... Theoretically, this would leave more room in the cache for everything else.

So, is their a way to get a block of memory marked as non cacheable in Linux?

375

asked May 20 '09 00:05

dicroce

2 Answers

How to avoid polluting the caches with data like this is covered in What Every Programmer Should Know About Memory (PDF) - This is written from the perspective of Red Hat development so perfect for you. However, most of it is cross-platform.

What you want is called "Non-Temporal Access" and tell the processor to expect that the value you are reading now will not be needed again for a while. The processor then avoids caching that value.

See page 49 of the PDF I linked above. It uses the intel intrinsic to do the streaming around the cache.

On the read side, processors, until recently, lacked support aside from weak hints using non-temporal access (NTA) prefetch instructions. There is no equivalent to write-combining for reads, which is especially bad for uncacheable memory such as memory-mapped I/O. Intel, with the SSE4.1 extensions, introduced NTA loads. They are implemented using a small number of streaming load buffers; each buffer contains a cache line. The first movntdqa instruction for a given cache line will load a cache line into a buffer, possibly replacing another cache line. Subsequent 16-byte aligned accesses to the same cache line will be serviced from the load buffer at little cost. Unless there are other reasons to do so, the cache line will not be loaded into a cache, thus enabling the loading of large amounts of memory without polluting the caches. The compiler provides an intrinsic for this instruction:

#include <smmintrin.h>
__m128i _mm_stream_load_si128 (__m128i *p);

This intrinsic should be used multiple times, with addresses of 16-byte blocks passed as the parameter, until each cache line is read. Only then should the next cache line be started. Since there are a few streaming read buffers it might be possible to read from two memory locations at once

It would be perfect for you if when reading, the buffers are read in linear order through memory. You use streaming reads to do so. When you want to modify them, the buffers are modified in linear order, and you can use streaming writes to do that if you don't expect to read them again any time soon from the same thread.

105

answered Oct 27 '22 11:10

Tom Leys

Frequently updated data actually is the perfect application of cache. As jdt mentioned, modern CPU caches are quite large, and 0.5mb might well fit in cache. More importantly, though, read-modify-write to uncached memory is VERY slow - the initial read has to block on memory, then the write operation ALSO has to block on memory in order to commit. And just to add insult to injury, the CPU might implement no-cache memory by loading the data into cache, then immediately invalidating the cache line - thus leaving you in a position which is guaranteed to be worse than before.

Before you try outsmarting the CPU like this, you really should benchmark the entire program, and see where the real slowdown is. Modern profilers such as valgrind's cachegrind can measure cache misses, so you can find if that is a significant source of slowdown as well.

On another, more practical note, if you're doing 30 RMWs per second, this is at the worst case something on the order of 1920 bytes of cache footprint. This is only 1/16 of the L1 size of a modern Core 2 processor, and likely to be lost in the general noise of the system. So don't worry about it too much :)

That said, if by 'accessed simultaneously' you mean 'accessed by multiple threads simultaneously', be careful about cache lines bouncing between CPUs. This wouldn't be helped by uncached RAM - if anything it'd be worse, as the data would have to travel all the way back to physical RAM each time instead of possibly passing through the faster inter-CPU bus - and the only way to avoid it as a problem is to minimize the frequency of access to shared data. For more about this, see http://www.ddj.com/hpc-high-performance-computing/217500206

answered Oct 27 '22 11:10

bdonlan

Related questions
                            
                                netcat with milliseconds interval
                            
                                Linux Kernel - What does it mean to "put" an inode?
                            
                                Perl: closing subprocess pipe in signal handler hangs?
                            
                                What's the difference between "sudo -i" and "sudo su -" [closed]
                            
                                Why does fork() result in duplicated output? [duplicate]
                            
                                API to read the device tree from userspace
                            
                                How to get the pid of another process in c?
                            
                                diff on columns of two files in shell
                            
                                What are the return values of system calls in Assembly?
                            
                                How to detect if system has IPv6 enabled in a UNIX shell script?
                            
                                Bash/SH, Same command different output?
                            
                                -bash: fork: Cannot allocate memory
                            
                                How do I terminate GNU parallel without killing running jobs?
                            
                                Storing locally encrypted incremental ZFS snapshots in Amazon Glacier
                            
                                Executable path is not absolute, ignoring: $(which node)
                            
                                Why am I getting "cat: write error: Broken pipe" rarely and not always
                            
                                What's the expected behavior of stack protection with statically-sized arrays?
                            
                                How to suppress irrelevant ShellCheck messages?
                            
                                What is the proper way to benchmark part of tensorflow graph?
                            
                                How does system() exactly work in linux?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With