Flush cache to DRAM

Tags:

I'm using a Xilinx Zynq platform with a region of memory shared between the programmable HW and the ARM processor.

I've reserved this memory using memmap on the kernel command line and then exposed it to userspace via mmap/io_remap_pfn_range calls in my driver.

The problem I'm having is that it takes some time for the writes to show up in DRAM and I presume it's stuck in dcache. There's a bunch of flush_cache_* calls defined but none of them are exported, which is a clue to me that I'm barking up the wrong tree...

As a trial I locally exported flush_cache_mm and just to see what would happen and no joy.

In short, how can I be sure that any writes to this mmap'd regions have been committed to DRAM?

Thanks.

898

asked Sep 19 '13 14:09

Brian Magnuson

3 Answers

The ARM processors typically have both a I/D cache and a write buffer. The idea of a write buffer is to gang sequential writes together (great for synchronous DRAM) and to not delay the CPU to wait for a write to complete.

To be generic, you can flush the d cache and the write buffer. The following is some inline ARM assembler which should work for many architectures and memory configurations.

 static inline void dcache_clean(void)
 {
     const int zero = 0;
     /* clean entire D cache -> push to external memory. */
     __asm volatile ("1: mrc p15, 0, r15, c7, c10, 3\n"
                     " bne 1b\n" ::: "cc");
     /* drain the write buffer */
    __asm volatile ("mcr 15, 0, %0, c7, c10, 4"::"r" (zero));
 }

You may need more if you have an L2 cache.

To answer in a Linux context, there are different CPU variants and different routines depending on memory/MMU configurations and even CPU errata. See for instance,

proc-arm926.S
cache-v7.S
cache-v6.S
etc

These routines are either called directly or looked up in a cpu info structure with function pointers to the appropriate routine for the detected CPU and configuration; depending on whether the kernel is special purpose for a single CPU or multi-purpose like a Ubuntu distribution.

To answer the question specifically for your situation, we need to know L2 cache, write buffered memory, CPU architecture specifics; maybe including silicon revisions for errata. Another tactic is to avoid this completely by using the dma_alloc_XXX() routines which mark memory as un-cacheable and un-bufferable so that the CPU writes are pushed externally immediately. Depending on your memory access pattern, either solution is valid. You may wish to cache if the memory only needs to be synchronized at some checkpoint (vsync/*hsync* for video, etc).

150

answered Oct 11 '22 06:10

artless noise

I hit the exact same problem, on zynq. Finally got L2 flushed/invalidated with:

#include <asm/outercache.h>
outer_cache.flush_range(start,size);
outer_cache.inv_range(start,size);

start is a kernel virtual space pointer. You also need to flush L1 to L2:

__cpuc_flush_dcache_area(start,size);

I'm not sure if invalidating L1 is needed before reading, and I haven't found the function to do this. I assume it would need to be, and I've thus far only been lucky...

Seems any suggestions on the 'net that I found assume the device to be "inside" of the L2 cache coherency, so they did not work if the AXI-HP ports were used. With the AXI-ACP port used, L2 flushing was not needed. (For those not familiar with zync: the HP-ports access the DRAM controller directly, bypassing any cache/MMU implemented on ARM side)

answered Oct 11 '22 08:10

user2365669

I'm not familiar with Zynq, but you essentially have two options that really work:

either include your other logic on the FPGA in the same coherency domain (if Zynq has an ACP port, for example)
or mark the memory you map as device memory (or other non-cacheable if you don't care about gather, reorder and early write acknowledge) and use a DSB after any write that should be seen.

If the memory is marked as cacheable and your other observer is not in the same coherency domain you are asking for trouble - when you clean the D-cache with a DCCISW or similar op and you have an L2 cache - that's where it'll all end up in.

answered Oct 11 '22 06:10

Alex Hornung

Related questions
                            
                                Using the Linux sysfs_notify call
                            
                                What is the significance of /queue/rotational in Linux?
                            
                                Enable monitoring mode for RTL8188CUS via USB on Raspbian
                            
                                What is the difference between module_init and init_module in a Linux kernel module?
                            
                                Why is the kernel concerned about issuing PHYSICALLY contiguous pages?
                            
                                Writing x86_64 linux kernel module in assembler
                            
                                Enlarge Linux Kernel Log Buffer more that 2M
                            
                                Reason why CFS scheduler using red black tree?
                            
                                How to find physical and logical core number in a kernel module?
                            
                                device-tree mismatch: .probe never called
                            
                                Are there any advantages to using a binary semaphore instead of a mutex for mutual exclusion in a critical section of a queue?
                            
                                Linux 3/1 virtual address split
                            
                                Linux rt- patch for android anyone?
                            
                                how to access and debug a FDT/DTS device tree from a Linux driver (seg-fault)
                            
                                Getting user-space stack information from perf
                            
                                How to undo rm -rf? [closed]
                            
                                Device tree driven kernel for raspberry pi
                            
                                Building kernel uImage using LOADADDR
                            
                                Need to "calculate" optimum ulimit and fs.file-max values according to my own server needs
                            
                                What is the Difference B/W TCB(Thread control block) & PCB(Process)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Flush cache to DRAM

Tags:

linux-kernel

arm

zynq

xilinx