Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how is tcp(kernel) bypass implemented?

Assuming I would like to avoid the overhead of the linux kernel in handling incoming packets and instead would like to grab the packet directly from user space. I have googled around a bit and it seems that all that needs to happen is one would use raw sockets with some socket options. Is this the case? Or is it more involved than this? And if so, what can I google for or reference in order to implement something like this?

like image 380
Palace Chan Avatar asked Jul 30 '12 20:07

Palace Chan


People also ask

How does kernel bypass work?

Kernel-bypass networking eliminates the overheads of in-kernel network stacks by moving protocol processing to userspace. The packet I/O is either handled by the hardware, the OS, or by userspace, depending on the specific kernel-bypass architecture in use.

Is TCP implemented in kernel?

TCP happens in the kernel. Not in user processes.

What is userspace networking?

It's a multi-vendor and multi-architecture project, and it aims at achieving high I/O performance and reaching high packet processing rates, which are some of the most important features in the networking arena. It was created by Intel in 2010 and moved to the Linux Foundation in April 2017.


1 Answers

There are many techniques for networking with kernel bypass.

First, if you are sending messages to another process on the same machine, you can do so through a shared memory region with no jumps into the kernel.

Passing packets over a network without involving the kernel gets more interesting, and involves specialized hardware that gets direct access to user memory. This idea is called RDMA.

Here's one way it can work (this is what InfiniBand hardware does). The application registers a memory buffer with the RDMA hardware. This buffer is pinned in physical memory, since swapping it out would obviously be bad (since the hardware will keep writing to the physical memory region). A control region is also mapped into userspace memory. When an application is ready to use the buffer to send or receive a message, it writes a command to the control region. The hardware takes the data from a registered buffer on one end, and places it into another registered buffer at the other end.

Clearly, this is too low level, so there are abstractions that make programming RDMA hardware easier. OFED verbs are one such abstraction.

The InfiniBand software stack has one extra interesting bit: the Sockets Direct Protocol (SDP) that is used for compatibility with existing applications. It works by inserting an LD_PRELOAD shim that translates standard socket API calls into IB verbs.

InfiniBand is just what I'm most familiar with. RoCE/iWARP hardware is very similar from the programmer's perspective, but uses a different transport than InfiniBand (TCP using an offload engine in iWarp, Ethernet in RoCE). There are/were also other approaches to RDMA (Quadrics, for example).

like image 149
Greg Inozemtsev Avatar answered Oct 22 '22 22:10

Greg Inozemtsev