Low-latency communication between threads in the same process

Question

Console application has 3 threads: Main, T1, T2. The goal is to 'signal' both T1, T2 (and let them do some work) from the Main thread in the lowest latency as possible (μs)

NOTE:

please ignore Jitter, GC etc. (I can handle that)
ElapsedLogger.WriteLine call cost is below 50ns (nano sec)

Have a look at the code below:

sample 1

class Program
{
    private static string msg = string.Empty;
    private static readonly CountdownEvent Countdown = new CountdownEvent(1);

    static void Main(string[] args)
    {
        while (true)
        {
            Countdown.Reset(1);
            var t1 = new Thread(Dowork) { Priority = ThreadPriority.Highest };
            var t2 = new Thread(Dowork) { Priority = ThreadPriority.Highest };
            t1.Start();
            t2.Start();

            Console.WriteLine("Type message and press [enter] to start");
            msg = Console.ReadLine();

            ElapsedLogger.WriteLine("Kick off!");
            Countdown.Signal();

            Thread.Sleep(250);
            ElapsedLogger.FlushToConsole();
        }
    }
    private static void Dowork()
    {
        string t = Thread.CurrentThread.ManagedThreadId.ToString();
        ElapsedLogger.WriteLine("{0} - Waiting...", t);

        Countdown.Wait();

        ElapsedLogger.WriteLine("{0} - Message received: {1}", t, msg);
    }
}

Output:

Type message and press [enter] to start
test3
20141028 12:03:24.230647|5 - Waiting...
20141028 12:03:24.230851|6 - Waiting...
20141028 12:03:30.640351|Kick off!
20141028 12:03:30.640392|5 - Message received: test3
20141028 12:03:30.640394|6 - Message received: test3

Type message and press [enter] to start
test4
20141028 12:03:30.891853|7 - Waiting...
20141028 12:03:30.892072|8 - Waiting...
20141028 12:03:42.024499|Kick off!
20141028 12:03:42.024538|7 - Message received: test4
20141028 12:03:42.024551|8 - Message received: test4

In the above code 'latency' is around 40-50μs. CountdownEvent signaling call is very cheap (less than 50ns) but T1,T2 threads are suspended and it takes time to wake them up.

sample 2

class Program
{
    private static string _msg = string.Empty;
    private static bool _signal = false;

    static void Main(string[] args)
    {
        while (true)
        {
            _signal = false;
            var t1 = new Thread(Dowork) {Priority = ThreadPriority.Highest};
            var t2 = new Thread(Dowork) {Priority = ThreadPriority.Highest};
            t1.Start();
            t2.Start();

            Console.WriteLine("Type message and press [enter] to start");
            _msg = Console.ReadLine();

            ElapsedLogger.WriteLine("Kick off!");
            _signal = true;

            Thread.Sleep(250);
            ElapsedLogger.FlushToConsole();
        }
    }
    private static void Dowork()
    {
        string t = Thread.CurrentThread.ManagedThreadId.ToString();
        ElapsedLogger.WriteLine("{0} - Waiting...", t);

        while (!_signal) { Thread.SpinWait(10); }

        ElapsedLogger.WriteLine("{0} - Message received: {1}", t, _msg);
    }
}

Output:

Type message and press [enter] to start
testMsg
20141028 11:56:57.829870|5 - Waiting...
20141028 11:56:57.830121|6 - Waiting...
20141028 11:57:05.456075|Kick off!
20141028 11:57:05.456081|6 - Message received: testMsg
20141028 11:57:05.456081|5 - Message received: testMsg

Type message and press [enter] to start
testMsg2
20141028 11:57:05.707528|7 - Waiting...
20141028 11:57:05.707754|8 - Waiting...
20141028 11:57:57.535549|Kick off!
20141028 11:57:57.535576|7 - Message received: testMsg2
20141028 11:57:57.535576|8 - Message received: testMsg2

This time 'latency' is around 6-7μs. (but high CPU) This is because T1,T2 threads are forced to be active (they doing nothing just burn CPU time)

In 'real' application I cannot spin CPU like that (I have far to many active threads and it would make it worse/slower or even kill the server).

Is it anything I can use instead to drop latency to something around 10-15 μs? I guess with Producer/Consumer pattern it won't make is quicker than using CountdownEvent. Wait/Pulse is also more expensive than CountdownEvent.

Is what I got in sample 1 the best I can achieve?

Any suggestions?

I'll try raw sockets as well when I have a time.

ZXX · Accepted Answer

You tried to oversimplify this and then whichever way you turn something is going to bite you. Thread.SpinWait(int) was never meant to be used alone and as a blunt instrument. To use it you need to pre-calculate, essentially calibrate (based on the current system info, clock, scheduler interrupt timer interval) the optimal number of iterations for spin lock. After you exhaust that budget you need to voluntary sleep/yield/wait. The whole arrangement is usually called 2-level wait or 2-phase wait.

You need to be aware that once you cross that line your minimal latency is the scheduler interrupt timer interval (ClockRes from System Internals, at least 1 ms on Win10, if any "measurement" gives you lower value either measurement is broken or you didn't really go to sleep). On 2016 Server minimum is 12 ms.

How you measure is very important. If you call some kernel functions to measure local/in-process time that will give you seductively low numbers but they are not real. If you use QueryPerformanceCounter (Stopwatch class uses it) measurement resolution is 1000 real ticks (1/3 μs on a 3 GHz CPU). If you use RDTSC nominal resolution is CPU clock but that's terribly jittery and gives you the illusion of precision that's not there. These 333 ns are the absolutely smallest interval you can measure reliably without VTune or hardware tracer.

On to Sleepers

Thread.Yield() is the lightest but with a caveat. On an idle system it's a nop => you are back to too a tight spinner. On a busy system it's at least the time till the next scheduler interval which is almost the same as sleep(0) but without the overhead. Also it will switch only to a thread that's already scheduled to run on the same core which means that it has higher chances of degenerating into nop.

SpinWait struct is next lightest. It does it's own 2-level wait but with hard spin and yield, meaning that it still needs real 2nd level. Bit id does the counting math for you and will tell you when it's going to yield which you can take as a signal to go to sleep.

ManualResetEventSlim is the next lightest and on a busy system it might be faster than yield since it can continue if threads involved didn't go to sleep and their quantum budget is not exhausted.

Thread.Sleep(int) is next. Sleep(0) is considered lighter since it doesn't have time evaluation and yields only to threads with same or higher priority but for your low latency purposes it doesn't mean much. Sleep(1) unconditionally yields even to lower priority threads and has time evaluation code path but the minimal timer slice is 1 ms anyway. Both end up sleeping longer since on a busy system there's always plenty of threads with same or higher priority to make sure that it won't have much chances of running in the next slice.

Raising thread priorities to real time level will help only temporarily. Kernel has a defense mechanism that will kick their priorities down after a short run - meaning that you'll need to keep re-raising them every time they run. Windows is not an RTOS.

Any time you go to sleep, via any method, you have to expect at least one time slice delay. Avoiding such delay is exactly the use case for spin locks. Any time you go to sleep, via any method, you have to expect at least one time slice delay. Condition Variables could be potential "middle ground" in theory but since C#/.NET don't have native support for that you'd have to import a dll and call native functions and there is no guarantee that the'll be ultra responsive. Immediate wake up is never guaranteed - even in C++. To do something like that you'd have to hijack an interrupt - impossible in .NET, very hard in C++ and risky.

Using CPU time is actually not bad if your cores are memory bound and starved, which is routinely the case with CPU oversubscription (too many threads for the number of cores) and large in-memory crawlers (indexes, graphs, anything else you keep locked in memory on the GB scale). Then they don't have anything else to do anyway.

If however you are computation intensive (ALU and FPU bound) then spinning can be bad.

Hyperthreading is always bad. Under stress it will heat up cores a lot and lower perf since they are fake pseudo-processors with very little truly independent hardware. Thread.Yield() was more or less invented to lower the pressure from hyperthreading but if you are chasing low latency first rule is - turn hyperthreads off for good.

Also be aware that any measurement for these kinds of things without a hardware tracer or VTune and without careful management of thread-core affinities is pointless. You'll see all kinds of mirages and won't see what's really important - the effect of trashed CPU caches, their latency and memory latency. Plus, you really need a test box that is replica of what's running live, in production, since huge number of factors depend on nuances of concrete usage patterns and they are not reproducible on a substantially different configuration.

Reserving Cores

You'll need to reserve a number of cores for exclusive use by your latency critical threads, 1 per core if it's very critical. If you go with 1-1 then plain spinning is perfectly fine. Otherwise yield is perfectly fine. This is the real use-case for SpinWait struct and having that reserved and clean state is the first pre-condition. With 1-1 setup relatively simple measurements become relevant again and even RDTSC becomes smooth enough for regular use.

That realm of carefully guarded cores and super-threads can be your own little RTOS but you need to be very careful and you have to mange everything. Can't go to sleep, if you do, you are back to scheduler time slice delay.

If you have very deterministic state and a calculation that N of them have the time to run before the usual latency budget is spent you can go for fibers and then you control everything.

The number of these super-threads per core depends on what are they doing, are they memory bound, how much memory to they need and the number of them that can coexist in the same cache without trashing each other's lines. Need to do the math for all 3 caches and be conservative. This is also where VTune or hardware tracer can help a lot - then you can just run and see.

Oh and the hardware doesn't have to be prohibitively expensive for these things anymore. Ryzen Threadripper with 16 cores can do it just fine.

Ben Voigt · Answer

There's not a whole lot that can be done, since the other thread has to be scheduled by the OS.

Increasing the priority of the waiting thread is the only thing likely to make much difference, and you've already done that. You could go even higher.

If you really need the lowest possible latency for activation of another task, you should turn it into a function that can be called directly from the triggering thread.

Low-latency communication between threads in the same process

Tags:

performance

c#

multithreading

low-latency

ipc

Novitzky

Video Answer

2 Answers

ZXX

Ben Voigt

Recent Activity

Donate For Us

Low-latency communication between threads in the same process

Tags:

performance

c#

multithreading

low-latency

ipc

Novitzky

Video Answer

2 Answers

ZXX

Ben Voigt

Related questions

Recent Activity

Donate For Us