Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Zero-copy user-space TCP send of dma_mmap_coherent() mapped memory

I'm running Linux 5.1 on a Cyclone V SoC, which is an FPGA with two ARMv7 cores in one chip. My goal is to gather lots of data from an external interface and stream (part of) this data out through a TCP socket. The challenge here is that the data rate is very high and could come close to saturating the GbE interface. I have a working implementation that just uses write() calls to the socket, but it tops out at 55MB/s; roughly half the theoretical GbE limit. I'm now trying to get zero-copy TCP transmission to work to increase the throughput, but I'm hitting a wall.

To get the data out of the FPGA into Linux user-space, I've written a kernel driver. This driver uses a DMA block in the FPGA to copy a large amount of data from an external interface into DDR3 memory attached to the ARMv7 cores. The driver allocates this memory as a bunch of contiguous 1MB buffers when probed using dma_alloc_coherent() with GFP_USER, and exposes these to the userspace application by implementing mmap() on a file in /dev/ and returning an address to the application using dma_mmap_coherent() on the preallocated buffers.

So far so good; the user-space application is seeing valid data and the throughput is more than enough at >360MB/s with room to spare (the external interface is not fast enough to really see what the upper bound is).

To implement zero-copy TCP networking, my first approach was to use SO_ZEROCOPY on the socket:

sent_bytes = send(fd, buf, len, MSG_ZEROCOPY);
if (sent_bytes < 0) {
    perror("send");
    return -1;
}

However, this results in send: Bad address.

After googling for a bit, my second approach was to use a pipe and splice() followed by vmsplice():

ssize_t sent_bytes;
int pipes[2];
struct iovec iov = {
    .iov_base = buf,
    .iov_len = len
};

pipe(pipes);

sent_bytes = vmsplice(pipes[1], &iov, 1, 0);
if (sent_bytes < 0) {
    perror("vmsplice");
    return -1;
}
sent_bytes = splice(pipes[0], 0, fd, 0, sent_bytes, SPLICE_F_MOVE);
if (sent_bytes < 0) {
    perror("splice");
    return -1;
}

However, the result is the same: vmsplice: Bad address.

Note that if I replace the call to vmsplice() or send() to a function that just prints the data pointed to by buf (or a send() without MSG_ZEROCOPY), everything is working just fine; so the data is accessible to userspace, but the vmsplice()/send(..., MSG_ZEROCOPY) calls seem unable to handle it.

What am I missing here? Is there any way of using zero-copy TCP sending with a user-space address obtained from a kernel driver through dma_mmap_coherent()? Is there another approach I could use?

UPDATE

So I dove a bit deeper into the sendmsg() MSG_ZEROCOPY path in the kernel, and the call that eventually fails is get_user_pages_fast(). This call returns -EFAULT because check_vma_flags() finds the VM_PFNMAP flag set in the vma. This flag is apparently set when the pages are mapped into user space using remap_pfn_range() or dma_mmap_coherent(). My next approach is to find another way to mmap these pages.

like image 211
rem Avatar asked Oct 30 '19 14:10

rem


2 Answers

Maybe this will help you to understand why alloc_pages requires a power-of-2 page number.

To optimize the page allocation process(and decrease external fragmentations), which is frequently engaged, Linux kernel developed per-cpu page cache and buddy-allocator to allocate memory(there is another allocator, slab, to serve memory allocations that are smaller than a page).

Per-cpu page cache serve the one-page allocation request, while buddy-allocator keeps 11 lists, each containing 2^{0-10} physical pages respectively. These lists perform well when allocate and free pages, and of course, the premise is you are requesting a power-of-2-sized buffer.

like image 25
YungeG Avatar answered Sep 18 '22 01:09

YungeG


As I posted in an update in my question, the underlying problem is that zerocopy networking does not work for memory that has been mapped using remap_pfn_range() (which dma_mmap_coherent() happens to use under the hood as well). The reason is that this type of memory (with the VM_PFNMAP flag set) does not have metadata in the form of struct page* associated with each page, which it needs.

The solution then is to allocate the memory in a way that struct page*s are associated with the memory.

The workflow that now works for me to allocate the memory is:

  1. Use struct page* page = alloc_pages(GFP_USER, page_order); to allocate a block of contiguous physical memory, where the number of contiguous pages that will be allocated is given by 2**page_order.
  2. Split the high-order/compound page into 0-order pages by calling split_page(page, page_order);. This now means that struct page* page has become an array with 2**page_order entries.

Now to submit such a region to the DMA (for data reception):

  1. dma_addr = dma_map_page(dev, page, 0, length, DMA_FROM_DEVICE);
  2. dma_desc = dmaengine_prep_slave_single(dma_chan, dma_addr, length, DMA_DEV_TO_MEM, 0);
  3. dmaengine_submit(dma_desc);

When we get a callback from the DMA that the transfer has finished, we need to unmap the region to transfer ownership of this block of memory back to the CPU, which takes care of caches to make sure we're not reading stale data:

  1. dma_unmap_page(dev, dma_addr, length, DMA_FROM_DEVICE);

Now, when we want to implement mmap(), all we really have to do is call vm_insert_page() repeatedly for all of the 0-order pages that we pre-allocated:

static int my_mmap(struct file *file, struct vm_area_struct *vma) {
    int res;
...
    for (i = 0; i < 2**page_order; ++i) {
        if ((res = vm_insert_page(vma, vma->vm_start + i*PAGE_SIZE, &page[i])) < 0) {
            break;
        }
    }
    vma->vm_flags |= VM_LOCKED | VM_DONTCOPY | VM_DONTEXPAND | VM_DENYWRITE;
...
    return res;
}

When the file is closed, don't forget to free the pages:

for (i = 0; i < 2**page_order; ++i) {
    __free_page(&dev->shm[i].pages[i]);
}

Implementing mmap() this way now allows a socket to use this buffer for sendmsg() with the MSG_ZEROCOPY flag.

Although this works, there are two things that don't sit well with me with this approach:

  • You can only allocate power-of-2-sized buffers with this method, although you could implement logic to call alloc_pages as many times as needed with decreasing orders to get any size buffer made up of sub-buffers of varying sizes. This will then require some logic to tie these buffers together in the mmap() and to DMA them with scatter-gather (sg) calls rather than single.
  • split_page() says in its documentation:
 * Note: this is probably too low level an operation for use in drivers.
 * Please consult with lkml before using this in your driver.

These issues would be easily solved if there was some interface in the kernel to allocate an arbitrary amount of contiguous physical pages. I don't know why there isn't, but I don't find the above issues so important as to go digging into why this isn't available / how to implement it :-)

like image 53
rem Avatar answered Sep 20 '22 01:09

rem