Should we use multiple acceptor sockets to accept a large number of connections?

Tags:

As known, SO_REUSEPORT allows multiple sockets to listen on the same IP address and port combination, it increases requests per second by 2 to 3 times, and reduces both latency (~30%) and the standard deviation for latency (8 times): https://www.nginx.com/blog/socket-sharding-nginx-release-1-9-1/

NGINX release 1.9.1 introduces a new feature that enables use of the SO_REUSEPORT socket option, which is available in newer versions of many operating systems, including DragonFly BSD and Linux (kernel version 3.9 and later). This socket option allows multiple sockets to listen on the same IP address and port combination. The kernel then load balances incoming connections across the sockets. ...

As shown in the figure, reuseport increases requests per second by 2 to 3 times, and reduces both latency and the standard deviation for latency.

enter image description here

SO_REUSEPORT is available on most modern OS: Linux (kernel >= 3.9 since 29 Apr 2013), Free/Open/NetBSD, MacOS, iOS/watchOS/tvOS, IBM AIX 7.2, Oracle Solaris 11.1, Windows (is only SO_REUSEPORT that behaves as 2 flags together SO_REUSEPORT+SO_REUSEADDR in BSD), and may be on Android : https://stackoverflow.com/a/14388707/1558037

Linux >= 3.9

Additionally the kernel performs some "special magic" for SO_REUSEPORT sockets that isn't found in other operating systems: For UDP sockets, it tries to distribute datagrams evenly, for TCP listening sockets, it tries to distribute incoming connect requests (those accepted by calling accept()) evenly across all the sockets that share the same address and port combination. Thus an application can easily open the same port in multiple child processes and then use SO_REUSEPORT to get a very inexpensive load balancing.

Also known, to avoid locks of spin-lock and achive high performance there shouldn't be sockets which read more than 1 thread. I.e. each thread should processes its own sockets for read/write.

accept() is thread-safe function for the same socket descriptor, so it should be guarded by lock - so lock contention reduces performance: http://unix.derkeiler.com/Newsgroups/comp.unix.programmer/2007-06/msg00246.html

POSIX.1-2001/SUSv3 requires accept(), bind(), connect(), listen(), socket(), send(), recv(), etc. to be thread-safe functions. It's possible that there are some ambiguities in the standard regarding their interaction with threads, but the intention is that their behaviour in multithreaded programs is governed by the standard.

If we use the same one socket from many threads, then performance will be low because socket defended by lock for thread-safe accessing from many threads: https://blog.cloudflare.com/how-to-receive-a-million-packets/

The receiving performance is down compared to a single threaded program. That's caused by a lock contention on the UDP receive buffer side. Since both threads are using the same socket descriptor, they spend a disproportionate amount of time fighting for a lock around the UDP receive buffer. This paper describes the problem in more detail.

More details about spin-lock when the application tries to read data from the socket - "Analysis of Linux UDP Sockets Concurrent Performance": http://www.jcc2014.ucm.cl/jornadas/WORKSHOP/WSDP%202014/WSDP-4.pdf

V. K ERNEL ISOLATION

....

From the other side, when the application tries to read data from the socket, it executes a similar process, which isdescribed below and represented in Figure 3 from right to left:

1) Dequeue one or more packets from the receive queue, using the corresponding spinlock (green one).

2) Copy the information to user-space memory.

3) Release the memory used by the packet. This potentiallychanges the state of the socket, so two ways of locking the socket can occur: fast and slow. In both cases, the packet is unlinked from the socket, Memory Accounting statistics are updated and socket is released according to the locking path taken.

I.e. when many threads are accessing the same socket, performance degrades due to waiting on one spin-lock.

We have 2 x Xeon 32 HT-Cores server with 64 total HT-cores, and two 10 Gbit Ethernet cards, and Linux (kernel 3.9).

We use RFS and XPS - i.e. for the same connection TCP/IP-stack processed (kernel-space) on the same CPU-Core as an application thread (user-space).

There are at least 3 ways to accept connections to processes it at many threads:

Use one acceptor socket shared between many threads, and each thread accept connections and processes it
Use one acceptor socket in 1 thread, and this thread push received socket descriptors of connections to other thread-workers by using thread-safe queue
Use many acceptor sockets which listen the same ip:port, 1 individual acceptor socket in each thread, and the thread that receives the connection then processes it (recv/send)

What is the more efficient way, If we accept a lot of new TCP-connections?

444

asked Jul 09 '17 21:07

Alex

2 Answers

Having had to handle such an occasion in production, here's a good way to approach this problem:

First, setup a single thread to handle all incoming connections. Modify the affinity map so that this thread has a dedicated core that no other threads in your application (or even your entire system) will try to access. You can also modify your boot scripts so that certain cores are never automatically assigned to an execution unit unless that specific core is explicitly requested (i.e. isolcpus kernel boot parameters).

Mark that core as un-used, and then explicitly request it in your code for the "listen to socket" thread via cpuset.

Next, setup a queue (ideally, a priority queue) that prioritizes write operations (i.e. "the second readers-writers problem). Now, setup however many worker threads as you see reasonable.

At this point, the goal of the "incoming connections" thread should be to:

accept() incoming connections.
Pass these connection file descriptors (FDs) off to your writer-prioritized queue structure as quickly as possible.
Go back to its accept() state as quickly as possible.

This will allow you to delegate incoming connections as quickly as possible. Your worker threads can grab items from the shared queue as they arrive. It might also be worth having a second, high-priority thread that grabs data from this queue, and moves it to a secondary queue, saving the "listen to socket" thread from having to spend extra cycles delegating client FDs.

This would also prevent the "listen to socket" thread and the worker threads from ever having to access the same queue concurrently, which would save you from worst-case scenarios like a slow worker thread locking the queue when the "listen to socket" thread wants to drop data in it. i.e.

Incoming client connections

 ||
 || Listener thread - accept() connection.
 \/

Listener/Helper queue

 ||
 || Helper thread
 \/

Shared Worker queue

 ||
 || Worker thread #n
 \/

Worker-specific memory space. read() from client.

As for your other two proposed options:

Use one acceptor socket shared between many threads, and each thread accept connections and processes it.

Messy. The threads will have to somehow take turns issuing the accept() call, and there won't be any benefit to doing this. You'll also have some additional sequencing logic to handle which thread's "turn" is up.

Use many acceptor sockets which listen the same ip:port, 1 individual acceptor socket in each thread, and the thread that receives the connection then processes it (recv/send)

Not the most portable option. I'd avoid it. Also, you'll potentially need to make your server process use multi-process (i.e. fork()) as opposed to multi-threaded, depending on OS, kernel version, etc.

170

answered Oct 08 '22 06:10

Cloud

Assuming you have two 10Gbps network connection and assuming a 500byte average frame size (which is very conservative for a server without interactive use), you'll have around 2Mpackets per second per network card (I don't believe you have more than this) and this means processing 4 packets per microsec. This is a very slow latency for a cpu like the one described in your configuration. On these premises, I's ensure that your bottleneck will be in the network (and the switches you connect to) than in the spinlock on each socket (it takes some cpu cycles to resolve on a spinlock, and this is far beyond the limit imposed by the network). Either, I'd dedicate a thread or two (one for reading and other for writing) maximum on each network card, and don't think much more in the socket locking features, anyway. Most probable is your bottleneck is in the application software you have in the backend of this configuration.

Even in the case you run into trouble, perhaps it would be better to do some modifications to the kernel sofware than adding up more and more processors or thinking on distributing the spinlocks into different sockets. Or even better, to add more network cards to aleviate the bottleneck.

answered Oct 08 '22 07:10

Luis Colorado

Related questions
                            
                                Best way to package a Python library that includes a C shared library?
                            
                                How to clear stdin before getting new input?
                            
                                How to emit debug information through LLVMs C bindings?
                            
                                setrlimit fails with Operation not permitted when run under valgrind
                            
                                what does this error suggest?
                            
                                Is it undefined behaviour to call a function with pointers to different elements of a union as arguments?
                            
                                multi-word addition using the carry flag
                            
                                Code::Blocks - warning: GDB: Failed to set controlling terminal: Operation not permitted
                            
                                Unable to understand a format string exploitation code
                            
                                CPython - locking the GIL in the main thread
                            
                                Memory layout of JavaScript objects in V8
                            
                                Starting at what version of Visual Studio is vsnprintf mostly standard-conformant?
                            
                                Loop with function call faster than an empty loop
                            
                                abort() is not __declspec(noreturn) in VS2010
                            
                                Array pointer aliasing - undefined behavior?
                            
                                Catching stack overflow
                            
                                C++11 nested macro invocation?
                            
                                Forcing the compiler to use a certain register for a certain variable
                            
                                When the compiler reorders AVX instructions on Sandy, does it affect performance?
                            
                                Fastest way to work with unaligned data on a word-aligned processor?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Should we use multiple acceptor sockets to accept a large number of connections?

Tags:

c

linux

multithreading

tcp

sockets

Alex

People also ask

2 Answers

Cloud

Luis Colorado

Recent Activity

Donate For Us