High-performance TCP Socket programming in .NET C#

Tags:

I know this topic is already asked sometimes, and I have read almost all threads and comments, but I'm still not finding the answer to my problem.

I'm working on a high-performance network library that must have TCP server and client, has to be able to accept even 30000+ connections, and the throughput has to be as high as possible.

I know very well I have to use async methods, and I have already implemented all kinds of solutions that I have found and tested them.

In my benchmarking, only the minimal code was used to avoid any overhead in the scope, I have used profiling to minimize the CPU load, there is no more room for simple optimization, on the receiving socket the buffer data was always read, counted and discarded to avoid socket buffer fill completely.

The case is very simple, one TCP Socket listens on localhost, another TCP Socket connects to the listening socket (from the same program, on the same machine oc.), then one infinite loop starts to send 256kB sized packets with the client socket to the server socket.

A timer with 1000ms interval prints a byte counter from both sockets to the console to make the bandwidth visible then resets them for the next measurement.

I've realized the sweet-spot for packet size is 256kB and the socket's buffer size is 64kB to have the maximum throughput.

With the async/await type methods I could reach

~370MB/s (~3.2gbps) on Windows, ~680MB/s (~5.8gbps) on Linux with mono

With the BeginReceive/EndReceive/BeginSend/EndSend type methods I could reach

~580MB/s (~5.0gbps) on Windows, ~9GB/s (~77.3gbps) on Linux with mono

With the SocketAsyncEventArgs/ReceiveAsync/SendAsync type methods I could reach

~1.4GB/s (~12gbps) on Windows, ~1.1GB/s (~9.4gbps) on Linux with mono

Problems are the following:

async/await methods were the slowest, so I will not work with them
BeginReceive/EndReceive methods started new async thread together with the BeginAccept/EndAccept methods, under Linux/mono every new instance of the socket was extremely slow (when there was no more thread in the ThreadPool mono started up new threads, but to create 25 instance of connections did take about 5 mins, creating 50 connections was impossible (program just stopped doing anything after ~30 connections).
Changing the ThreadPool size did not help at all, and I would not change it (it was just a debug move)
The best solution so far is SocketAsyncEventArgs, and that makes the highest throughput on Windows, but in Linux/mono it is slower than the Windows, and it was the opposite before.

I've benchmarked both my Windows and Linux machine with iperf,

Windows machine produced ~1GB/s (~8.58gbps), Linux machine produced ~8.5GB/s (~73.0gbps)

The weird thing is iperf could make a weaker result than my application, but on Linux, it is much higher.

First of all, I would like to know if the results are normal, or can I get better results with a different solution?

If I decide to use the BeginReceive/EndReceive methods (they produced relatively the highest result on Linux/mono) then how can I fix the threading problem, to make the connection instance creating fast, and eliminate the stalled state after creating multiple instances?

I continue making further benchmarks and will share the results if there is any new.

================================= UPDATE ==================================

I promised code snippets, but after many hours of experimenting the overall code is kind of a mess, so I would just share my experience in case it can help someone.

I had to realize under Window 7 the loopback device is slow, could not get higher result than 1GB/s with iperf or NTttcp, only Windows 8 and newer versions have fast loopback, so I don't care anymore about Windows results until I can test on newer version. SIO_LOOPBACK_FAST_PATH should be enabled via Socket.IOControl, but it throws exception on Windows 7.

It turned out the most powerful solution is the Completion event based SocketAsyncEventArgs implementation both on Windows and Linux/Mono. Creating a few thousand instances of the clients never messed up the ThreadPool, the program did not stop suddenly as I mentioned above. This implementation is very nice to the threading.

Creating 10 connections to the listening socket and feeding data from 10 separate thread from the ThreadPool with the clients together could produce ~2GB/s data traffic on Windows, and ~6GB/s on Linux/Mono.

Increasing the client connection count did not improve the overall throughput, but the total traffic became distributed among the connections, this might be because the CPU load was 100% on all cores/threads even with 5, 10 or 200 clients.

I think overall performance is not bad, 100 clients could produce around ~500mbit/s traffic each. (Of course this is measured in local connections, real life scenario on network would be different.)

The only observation I would share: experimenting with both the Socket in/out buffer sizes and with the program read/write buffer sizes/loop cycles highly affected the performance and very differently on Windows and on Linux/Mono.

On Windows the best performance has been reached with 128kB socket-receive, 32kB socket-send, 16kB program-read and 64kB program-write buffers.

On Linux the previous settings produced very weak performance, but 512kB socket-receive and -send both, 256kB program-read and 128kB program-write buffer sizes worked the best.

Now my only problem is if I try create 10000 connecting sockets, after around 7005 it just stops creating the instances, does not throw any exceptions, and the program is running as there was no any problem, but I don't know how can it quit from a specific for loop without break, but it does.

Any help would be appreciated regarding anything I was talking about!

563

asked Sep 05 '18 03:09

beatcoder

2 Answers

Because this question gets a lot of views I decided to post an "answer", but technically this isn't an answer, but my final conclusion for now, so I will mark it as answer.

About the approaches:

The async/await functions tend to produce awaitable async Tasks assigned to the TaskScheduler of the dotnet runtime, so having thousands of simultaneous connections, therefore thousands or reading/writing operations will start up thousands of Tasks. As far as I know this creates thousands of StateMachines stored in ram and countless context switchings in the threads they are assigned to, resulting in very high CPU overhead. With a few connections/async calls it is better balanced, but as the awaitable Task count grows it gets slow exponentially.

The BeginReceive/EndReceive/BeginSend/EndSend socket methods are technically async methods with no awaitable Tasks, but with callbacks on the end of the call, which actually optimizes more the multithreading, but still the limitation of the dotnet design of these socket methods are poor in my opinion, but for simple solutions (or limited count of connections) it is the way to go.

The SocketAsyncEventArgs/ReceiveAsync/SendAsync type of socket implementation is the best on Windows for a reason. It utilizes the Windows IOCP in the background to achieve the fastest async socket calls and use the Overlapped I/O and a special socket mode. This solution is the "simplest" and fastest under Windows. But under mono/linux, it never will be that fast, because mono emulates the Windows IOCP by using linux epoll, which actually is much faster than IOCP, but it has to emulate the IOCP to achieve dotnet compatibility, this causes some overhead.

About buffer sizes:

There are countless ways to handle data on sockets. Reading is straightforward, data arrives, You know the length of it, You just copy bytes from the socket buffer to Your application and process it. Sending data is a bit different.

You can pass Your complete data to the socket and it will cut it to chunks, copy the chucks to the socket buffer until there is no more to send and the sending method of the socket will return when all data is sent (or when error happens).
You can take Your data, cut it to chunks and call the socket send method with a chunk, and when it returns then send the next chunk until there is no more.

In any cases You should consider what socket buffer size You should choose. If You are sending large amount of data, then the bigger the buffer is, the less chunks has to be sent, therefore less calls in Your (or in the socket's internal) loop has to be called, less memory copy, less overhead. But allocating large socket buffers and program data buffers will result in large memory usage, especially if You are having thousands of connections, and allocating (and freeing up) large memory multiple times is always expensive.

On sending side 1-2-4-8kB socket buffer size is ideal for most cases, but if You are preparing to send large files (over few MB) regularly then 16-32-64kB buffer size is the way to go. Over 64kB there is usually no point to go.

But this has only advantage if the receiver side has relatively large receiving buffers too.

Usually over the internet connections (not local network) no point to get over 32kB, even 16kB is ideal.

Going under 4-8kB can result in exponentially incremented call count in the reading/writing loop, causing large CPU load and slow data processing in the application.

Go under 4kB only if You know Your messages will usually be smaller than 4kB, or just very rarely over 4KB.

My conclusion:

Regarding my experiments built-in socket class/methods/solutions in dotnet are OK, but not efficient at all. My simple linux C test programs using non-blocking sockets could overperform the fastest and "high-performance" solution of dotnet sockets (SocketAsyncEventArgs).

This does not mean it is impossible to have fast socket programming in dotnet, but under Windows I had to make my own implementation of Windows IOCP by directly communicating with the Windows Kernel via InteropServices/Marshaling, directly calling Winsock2 methods, using a lot of unsafe codes to pass the context structs of my connections as pointers between my classes/calls, creating my own ThreadPool, creating IO event handler threads, creating my own TaskScheduler to limit the count of simultaneous async calls to avoid pointlessly much context switches.

This was a lot of job with a lot of research, experiment, and testing. If You want to do it on Your own, do it only if You really think it worth it. Mixing unsafe/unmanage code with managed code is a pain in the ass, but the end it worth it, because with this solution I could reach with my own http server about 36000 http request/sec on a 1gbit lan, on Windows 7, with an i7 4790.

This is such a high performance that I never could reach with dotnet built-in sockets.

When running my dotnet server on an i9 7900X on Windows 10, connected to a 4c/8t Intel Atom NAS on Linux, via 10gbit lan, I can use the complete bandwidth (therefore copying data with 1GB/s) no matter if I have only 1 or 10000 simultaneous connections.

My socket library also detects if the code is running on linux, and then instead of Windows IOCP (obviously) it is using linux kernel calls via InteropServices/Marshalling to create, use sockets, and handle the socket events directly with linux epoll, managed to max out the performance of the test machines.

Design tip:

As it turned out it is difficult to design a networking library from scatch, especially one, that is likely very universal for all purposes. You have to design it to have many settings, or especially to the task You need. This means finding the proper socket buffer sizes, the I/O processing thread count, the Worker thread count, the allowed async task count, these all has to be tuned to the machine the application running on and to the connection count, and data type You want to transfer through the network. This is why the built-in sockets are not performing that good, because they must be universal, and they do not let You set these parameters.

In my case assingning more than 2 dedicated threads to I/O event processing actually makes the overall performance worse, because using only 2 RSS Queues, and causing more context switching than what is ideal.

Choosing wrong buffer sizes will result in performance loss.

Always benchmark different implementations for the simulated task You need to find out which solution or setting is the best.

Different settings may produce different performance results on different machines and/or operating systems!

Mono vs Dotnet Core:

Since I've programmed my socket library in a FW/Core compatible way I could test them under linux with mono, and with core native compilation. Most interestingly I could not observe any remarkable performance differences, both were fast, but of course leaving mono and compiling in core should be the way to go.

Bonus performance tip:

If Your network card is capable of RSS (Receive Side Scaling) then enable it in Windows in the network device settings in the advanced properties, and set the RSS Queue from 1 to as high you can/as high is the best for your performance.

If it is supported by Your network card then it is usually set to 1, this assigns the network event to process only by one CPU core by the kernel. If You can increment this queue count to higher numbers then it will distribute the network events between more CPU cores, and will result in much better performance.

In linux it is also possible to set this up, but in different ways, better to search for Your linux distro/lan driver information.

I hope my experience will help some of You!

answered Sep 18 '22 16:09

beatcoder

I had the same problem. You should take a look into: NetCoreServer

Every thread in the .NET clr threadpool can handle one task at one time. So to handle more async connects/reads etc., you have to change the threadpool size by using:

ThreadPool.SetMinThreads(Int32, Int32)

Using EAP (event based asynchronous pattern) is the way to go on Windows. I would use it on Linux too because of the problems you mentioned and take the performance plunge.

The best would be io completion ports on Windows, but they are not portable.

PS: when it comes to serialize objects, you are highly encouraged to use protobuf-net. It binary serializes objects up to 10x times faster than the .NET binary serializer and saves a little space too!

answered Sep 20 '22 16:09

Martin.Martinsson

Related questions
                            
                                Regular expression that matches all valid format IPv6 addresses
                            
                                Implementing External Authentication for Mobile App in ASP.NET WebApi 2
                            
                                What is the purpose of a restricting the type of generic in a method?
                            
                                Is Where on an Array (of a struct type) optimized to avoid needless copying of struct values?
                            
                                ASP.NET requirements for ClaimTypes
                            
                                How to setup single Nuget packages folder for multiple solutions and projects in Visual Studio 2015
                            
                                Using connection string from appsettings.json to startup.cs
                            
                                How to generalize my algorithm to detect if one string is a rotation of another
                            
                                Loading ASP.Net Core authorization policy from database
                            
                                Cannot take the address of, get the size of, or declare a pointer to a managed type ('T')
                            
                                Updating custom header value added as DefaultRequestHeaders of HttpClient
                            
                                Disable code analysis when using MSBuild 14
                            
                                Identity Server 4 and docker
                            
                                C# 7.0 case pattern matching on generic parameter
                            
                                HtmlAgilityPack & Selenium Webdriver returns random results
                            
                                yield return vs. return IEnumerable<T>
                            
                                Can I teach ReSharper a custom null check?
                            
                                What is the fluent object model to make this work?
                            
                                Why caching access token is consider bad in oauth2?
                            
                                DPI Awareness - Unaware in one Release, System Aware in the Other [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

High-performance TCP Socket programming in .NET C#

Tags:

c#

linux

sockets

asyncsocket

beatcoder

People also ask

2 Answers

beatcoder

Martin.Martinsson

Recent Activity

Donate For Us