ARM/neon memcpy optimized for *uncached* memory?

Question

I'm using a Xilinx Zynq 7000 ARM-based SoC. I'm struggling with DMA buffers (Need help mapping pre-reserved **cacheable** DMA buffer on Xilinx/ARM SoC (Zynq 7000)), so one thing I pursued was faster memcpy.

I've been looking at writing a faster memcpy for ARM using Neon instructions and inline asm. Whatever glibc has, it's terrible, especially if we're copying from an ucached DMA buffer.

I've put together my own copy function from various sources, including:

Fast ARM NEON memcpy
arm Inline assembly in gcc
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

The main difference for me is that I'm trying to copy from an uncached buffer because it's a DMA buffer, and ARM support for cached DMA buffers is nonexistent.

So here's what I wrote:

void my_copy(volatile unsigned char *dst, volatile unsigned char *src, int sz)
{
    if (sz & 63) {
        sz = (sz & -64) + 64;
    }
    asm volatile (
        "NEONCopyPLD:                          
"
        "    VLDM %[src]!,{d0-d7}                 
"
        "    VSTM %[dst]!,{d0-d7}                 
"
        "    SUBS %[sz],%[sz],#0x40                 
"
        "    BGT NEONCopyPLD                  
"
        : [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory");
}

The main thing I did was leave out the prefetch instruction because I figured it would be worthless on uncached memory.

Doing this resulted in a speedup of 4.7x over the glibc memcpy. Speed went from about 70MB/sec to about 330MB/sec.

Unfortunately, this isn't nearly as fast as memcpy from cached memory, which runs at around 720MB/sec for system memcpy and 620MB/sec for the Neon version (probably slower because my memcpy doesn't do prefetching, perhaps).

Can anyone help me figure out what I can do make up for this performance gap?

I've tried a number of things like copying more at once, two loads followed by two stores. I could try prefetch just to prove that it's useless. Any other ideas?

rsaxvc · Accepted Answer

If you're trying to do large, fast transfers, cached memory will often outperform uncached memory, but as you pointed out, support for cached DMA buffer memory must be managed somewhere, and on <=ARMv7, that place is the kernel / kernel-driver.

I'm assuming two things about your design:

Userspace is reading a memory-mapped hardware buffer
There's some sort of signal/event/interrupt from the FGPA to the CortexA9 VIC/GIC that tells the CortexA9 when a new buffer is available to read.

Align your DMA buffers on cacheline boundaries and do not place anything between the end of the DMA buffer and the next cacheline. Invalidate the cache whenever the FPGA signals the CPU that a buffer is ready.

I don't think the A9 has a mechanism to control cachelines on all cores and layers together, so you may wish to pin the program doing this to one core so that you can skip maintaining caches on the other core.

ARM/neon memcpy optimized for uncached memory?

Tags:

arm

memcpy

neon

soc

Timothy Miller

1 Answers

rsaxvc

Recent Activity

Donate For Us