How to explicitly load a structure into L1d cache?

Question

My goal is to load a static structure into the L1D cache. After that performing some operation using those structure members and after done with the operation run invd to discard all the modified cache lines. So basically I want to use create a secure environment inside the cache so that, while performing operations inside the cache, data will not be leaked into the RAM.

To do this, I have a kernel module. Where I placed some fixed values on the members of a structure. Then I disable preemption, disable cache for all other CPU (except current CPU), disable interrupt, then using __builtin_prefetch() to load my static structure into the cache. And after that, I overwrite the previously placed fixed values with new values. After that, I execute invd (to clear the modified cache line) and then enable cache to all other CPUs, enable interrupt & enable preemption. My rationale is, as I'm doing this while in atomic mode, INVD will remove all the changes. And after coming back from atomic mode, I should see the original fixed values that I have placed previously. That is however not happening. After coming out of the atomic mode, I can see the values, that Used to overwrite the previously placed fixed values. Here is my module code,

It's strange that after rebooting the PC, my output changes, I just don't understand why. Now, I'm not seeing any changes at all. I'm posting the full code including some fix @Peter Cordes suggested,

#include <linux/module.h>    
#include <linux/kernel.h>    
#include <linux/init.h>      
#include <linux/moduleparam.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Author");
MODULE_DESCRIPTION("test INVD");

static struct CACHE_ENV{
    unsigned char in[128];
    unsigned char out[128];
}cacheEnv __attribute__((aligned(64)));

#define cacheEnvSize (sizeof(cacheEnv)/64)
//#define change "Hello"
unsigned char change[]="hello";


void disCache(void *p){
    __asm__ __volatile__ (
        "wbinvd
"
        "mov %%cr0, %%rax
	"
        "or $(1<<30), %%eax
	"
        "mov %%rax, %%cr0
	"
        "wbinvd
"
        ::
        :"%rax"
    );

    printk(KERN_INFO "cpuid %d --> cache disable
", smp_processor_id());

}


void enaCache(void *p){
    __asm__ __volatile__ (
        "mov %%cr0, %%rax
	"
        "and $~(1<<30), %%eax
	"
        "mov %%rax, %%cr0
	"
        ::
        :"%rax"
    );

    printk(KERN_INFO "cpuid %d --> cache enable
", smp_processor_id());

}

int changeFixedValue (struct CACHE_ENV *env){
    int ret=1;
    //memcpy(env->in, change, sizeof (change));
    //memcpy(env->out, change,sizeof (change));

    strcpy(env->in,change);
    strcpy(env->out,change);
    return ret;
}

void fillCache(unsigned char *p, int num){
    int i;
    //unsigned char *buf = p;
    volatile unsigned char *buf=p;

    for(i=0;i<num;++i){
    
/*
        asm volatile(
        "movq $0,(%0)
"
        :
        :"r"(buf)
        :
        );
*/
        //__builtin_prefetch(buf,1,1);
        //__builtin_prefetch(buf,0,3);
        *buf += 0;
        buf += 64;   
     }
    printk(KERN_INFO "Inside fillCache, num is %d
", num);
}

static int __init device_init(void){
    unsigned long flags;
    int result;

    struct CACHE_ENV env;

    //setup Fixed values
    char word[] ="0xabcd";
    memcpy(env.in, word, sizeof(word) );
    memcpy(env.out, word, sizeof (word));
    printk(KERN_INFO "env.in fixed is %s
", env.in);
    printk(KERN_INFO "env.out fixed is %s
", env.out);

    printk(KERN_INFO "Current CPU %s
", smp_processor_id());

    // start atomic
    preempt_disable();
    smp_call_function(disCache,NULL,1);
    local_irq_save(flags);

    asm("lfence; mfence" ::: "memory");
    fillCache(&env, cacheEnvSize);
    
    result=changeFixedValue(&env);

    //asm volatile("invd
":::);
    asm volatile("invd
":::"memory");

    // exit atomic
    smp_call_function(enaCache,NULL,1);
    local_irq_restore(flags);
    preempt_enable();

    printk(KERN_INFO "After: env.in is %s
", env.in);
    printk(KERN_INFO "After: env.out is %s
", env.out);

    return 0;
}

static void __exit device_cleanup(void){
    printk(KERN_ALERT "Removing invd_driver.
");
}

module_init(device_init);
module_exit(device_cleanup);

And I'm getting the following output:

[ 3306.345292] env.in fixed is 0xabcd
[ 3306.345321] env.out fixed is 0xabcd
[ 3306.345322] Current CPU (null)
[ 3306.346390] cpuid 1 --> cache disable
[ 3306.346611] cpuid 3 --> cache disable
[ 3306.346844] cpuid 2 --> cache disable
[ 3306.347065] cpuid 0 --> cache disable
[ 3306.347313] cpuid 4 --> cache disable
[ 3306.347522] cpuid 5 --> cache disable
[ 3306.347755] cpuid 6 --> cache disable
[ 3306.351235] Inside fillCache, num is 4
[ 3306.352250] cpuid 3 --> cache enable
[ 3306.352997] cpuid 5 --> cache enable
[ 3306.353197] cpuid 4 --> cache enable
[ 3306.353220] cpuid 6 --> cache enable
[ 3306.353221] cpuid 2 --> cache enable
[ 3306.353221] cpuid 1 --> cache enable
[ 3306.353541] cpuid 0 --> cache enable
[ 3306.353608] After: env.in is hello
[ 3306.353609] After: env.out is hello

My Makefile is

obj-m += invdMod.o
CFLAGS_invdMod.o := -o0
invdMod-objs := disable_cache.o  

all:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
    rm -f *.o

Any thought about what I'm doing incorrectly? As I said before, I expect my output to remain unchanged.

One reason I can think of is that __builtin_prefetch() is not putting the structure into the cache. Another way to put something into the cache is by setting up a write-back region with the help of MTRR & PAT. However, I'm kind of clueless about how to achieve that. I found 12.6. Creating MTRRs from a C programme using ioctl()’s shows how to create a MTRR region but I can't figure out how Can I bind the address of my structure with that region.

My CPU is : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz

Kernel version : Linux xxx 4.4.0-200-generic #232-Ubuntu SMP Wed Jan 13 10:18:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

GCC version : gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609

I have compiled this module with -O0 parameter

Update 2: Hyperthreading off

I turned off hyperthreading with echo off > /sys/devices/system/cpu/smt/control. After that, running my module seems like, changeFixedValue() & fillCache() are not getting called.

output:

[ 3971.480133] env.in fixed is 0xabcd
[ 3971.480134] env.out fixed is 0xabcd
[ 3971.480135] Current CPU 3
[ 3971.480739] cpuid 2 --> cache disable
[ 3971.480956] cpuid 1 --> cache disable
[ 3971.481175] cpuid 0 --> cache disable
[ 3971.482771] cpuid 2 --> cache enable
[ 3971.482774] cpuid 0 --> cache enable
[ 3971.483043] cpuid 1 --> cache enable
[ 3971.483065] After: env.in is 0xabcd
[ 3971.483066] After: env.out is 0xabcd

Peter Cordes · Accepted Answer

It looks very unsafe to call printk at the bottom of fillCache. You're about to run a few more stores then an invd, so any modifications printk makes to kernel data structures (like the log buffer) might get written back to DRAM or might get invalidated if they're still dirty in cache. If some but not all stores make it to DRAM (because of limited cache capacity), you could leave kernel data structures in an inconsistent state.

I'd guess that your current tests with HT disabled show everything working even better than you hoped, including discarding stores done by printk, as well as discarding the stores done by changeFixedValue. That would explain the lack of log messages left for user-space to read once your code finishes.

To test this, you'd ideally want to clflush everything printk did, but there's no easy way to do that. Perhaps wbinvd then changeFixedValue then invd. (You're not entering no-fill mode on this core, so fillCache isn't necessary for your store / invd idea to work, see below.)

With Hyperthreading enabled:

CR0.CD is per-physical-core, so having your HT sibling core disable cache also means CD=1 for the isolated core. So with HT enabled, you were in no-fill mode even on the isolated core.

With HT disabled, the isolated core is still normal.

Compile-time and run-time reordering

asm volatile("invd ":::); without a "memory" clobber tells the compiler it's allowed to reorder it wrt. memory operations. Apparently that isn't the problem in your case, but it's a bug you should fix.

Probably also a good idea to put asm("mfence; lfence" ::: "memory"); right before fillCache, to make sure any cache-miss loads and stores aren't still in flight and maybe allocating new cache lines while your code is running. Or possibly even a fully serializing instruction like asm("xor %eax,%eax; cpuid" ::: "eax", "ebx", "ecx", "edx", "memory");, but I don't know of anything that CPUID blocks which mfence; lfence wouldn't.

The title question: touching memory to bring it into cache

PREFETCHT0 (into L1d cache) is __builtin_prefetch(p,0,3);. This answer shows how args maps to instructions; you're using prefetchw (write-intent) or I think prefetcht1 (L2 cache) depending on compiler options.

But really since you need this for correctness, you shouldn't be using optional hints that the HW can drop if it's busy. mfence; lfence would make it unlikely for the HW to actually be busy, but still not a bad idea.

Use a volatile read like READ_ONCE to get GCC to emit a load instruction. Or use volatile char *buf with *buf |= 0; or something to truly RMW instead of prefetch, to make sure the line is exclusively owned without having to get GCC to emit prefetchw.

Perhaps worth running fillCache a couple times, just to make more sure that every line is properly in the state you want. But since your env is smaller than 4k, each line will be in a different set in L1d cache, so there's no risk that one line got tossed out while allocating another (except in case of an alias in L3 cache's hash function? But even then, pseudo-LRU eviction should keep the most-recent line reliably.)

Align your data by 128, an aligned-pair of cache lines

static struct CACHE_ENV { ... } cacheEnv; isn't guaranteed to be aligned by the cache line size; you're missing C11 _Alignas(64) or GNU C __attribute__((aligned(64))). So it might be spanning more than sizeof(T)/64 lines. Or for good measure, align by 128 for the L2 adjacent-line prefetcher. (Here you can and should simply align your buffer, but The right way to use function _mm_clflush to flush a large struct shows how to loop over every cache line of an arbitrary-sized possibly-unaligned struct.)

This doesn't explain your problem, since the only part that might get missed is the last up-to-48 bytes of env.out. (I think the global struct will get aligned by 16 by default ABI rules.) And you're only printing the first few bytes of each array.

An easier way: memset(0) to avoid leaking data back to DRAM

And BTW, overwriting your buffer with 0 via memset after you're done should also keep your data from getting written back to DRAM about as reliably as INVD, but faster. (Maybe a manual rep stosb via asm to make sure it can't optimize away as a dead store).

No-fill mode might also be useful here to stop cache misses from evicting existing lines. AFAIK, that basically locks down the cache so no new allocations will happen, and thus no evictions. (But you might not be able to read or write other normal memory, although you could leave a result in registers.)

No-fill mode (for the current core) would make it definitely safe to clear your buffers with memset before re-enabling allocation; no risk of a cache miss during that causing an eviction. Although if your fillCache actually works properly and gets all your lines into MESI Modified state before you do your work, your loads and stores will hit in L1d cache without risk of evicting any of your buffer lines.

If you're worried about DRAM contents (rather than bus signals), then clflushopt each line after memset will reduce the window of vulnerability. (Or memcpy from a clean copy of the original if 0 doesn't work for you, but hopefully you can just work in a private copy and leave the orig unmodified. A stray write-back is always possible with your current method so I wouldn't want to rely on it to definitely always leave a large buffer unmodified.)

Don't use NT stores for a manual memset or memcpy: that might flush the "secret" dirty data before the NT store. One option would be to memset(0) with normal stores or rep stosb, then loop again with NT stores. Or perhaps doing 8x movq normal stores per line, then 8x movnti, so you do both things to the same line back to back before moving on.

Why fillCache at all?

If you're not using no-fill mode, it shouldn't even matter whether the lines are cached before you write to them. You just need your writes to be dirty in cache when invd runs, which should be true even if they got that way from your stores missing in cache.

You already don't have any barrier like mfence between fillCache and changeFixedValue, which is fine but means that any cache misses from priming the cache are still in flight when you dirty it.

INVD itself is serializing, so it should wait for stores to leave the store buffer before discarding cache contents. (So putting mfence;lfence after your work, before INVD, shouldn't make any difference.) In other words, INVD should discard cacheable stores that are still in the store buffer, as well as dirty cache lines, unless committing some of those stores happens to evict anything.

How to explicitly load a structure into L1d cache?

Tags:

cpu-architecture

c

x86

linux-kernel

cpu-cache

user45698746

1 Answers

With Hyperthreading enabled:

Compile-time and run-time reordering

The title question: touching memory to bring it into cache

Align your data by 128, an aligned-pair of cache lines

An easier way: memset(0) to avoid leaking data back to DRAM

Why fillCache at all?

Peter Cordes

Recent Activity

Donate For Us

How to explicitly load a structure into L1d cache?

Tags:

cpu-architecture

c

x86

linux-kernel

cpu-cache

user45698746

1 Answers

With Hyperthreading enabled:

Compile-time and run-time reordering

The title question: touching memory to bring it into cache

Align your data by 128, an aligned-pair of cache lines

An easier way: memset(0) to avoid leaking data back to DRAM

Why fillCache at all?

Peter Cordes

Related questions

Recent Activity

Donate For Us