should I "bind" "spinning" thread to the certain core?

Question

My application contains several latency-critical threads that "spin", i.e. never blocks. Such thread expected to take 100% of one CPU core. However it seems modern operation systems often transfer threads from one core to another. So, for example, with this Windows code:

void Processor::ConnectionThread()
{
    while (work)
    {
        Iterate();
    }
}

I do not see "100% occupied" core in Task manager, overall system load is 36-40%.

But if I change it to this:

void Processor::ConnectionThread()
{
    SetThreadAffinityMask(GetCurrentThread(), 2);
    while (work)
    {
        Iterate();
    }
}

Then I do see that one of the CPU cores is 100% occupied, also overall system load is reduced to 34-36%.

Does it mean that I should tend to SetThreadAffinityMask for "spin" threads? If I improved latency adding SetThreadAffinityMask in this case? What else should I do for "spin" threads to improve latency?

I'm in the middle of porting my application to Linux, so this question is more about Linux if this matters.

upd found this slide which shows that binding busy-waiting thread to CPU may help:

enter image description here

Surt · Accepted Answer

Running a thread locked to a single core gives the best latency for that thread in most circumstances if this is the most important thing in your code.

The reasons(R) are

your code is likely to be in your iCache
the branch predictors are tuned to your code
your data is likely to be ready in your dCache
the TLB points to your code and data.

Unless

Your running a SMT sytem (ex. hyperthreaded) in which case the evil twin will "help" you with by causing your code to be washed out, your branch predictors to be tuned to its code and its data will push your out of the dCache, your TLB is impacted by its use.
- Cost unknown, each cache misses cost ~4ns, ~15ns and ~75ns for data, this quickly runs up to several 1000ns.
- It saves for each reason R mentioned above, that is still there.
- If the evil twin also just spins the costs should be much lower.
Or your allowing interrupts on your core, in which case you get the same problems and
- your TLB is flushed
- you take a 1000ns-20000ns hit on the context switch, most should be in the low end if the drivers are well programmed.
Or you allow the OS to switch your process out, in which case you have the same problems as the interrupt, just in the hight end of the range.
- switching out could also cause the thread to pause for the entire slice as it can only be run on one (or two) hardware threads.
Or you use any system calls that cause context switches.
- No disk IO at all.
- only async IO else.
having more active (none-paused) threads than cores increases the likelihood of problems.

So if you need less than 100ns latency to keep your application from exploding you need to prevent or lessen the impact of SMT, interrupts and task switching on your core. The perfect solution would be an Real time operating system with static scheduling. This is a nearly perfect match for your target, but its a new world if your have mostly done server and desktop programming.

The disadvantages of locking a thread to a single core are:

It will cost some total throughput.
- as some threads that might have run if the context could have been switched.
- but the latency is more important in this case.
If the thread gets context switched out it will take some time before it can be scheduled potentially one or more time slices, typically 10-16ms, which is unacceptable in this application.
- Locking it to a core and its SMT will lessen this problem, but not eliminate it. Each added core will lessen the problem.
- setting its priority higher will lessen the problem, but not eliminate it.
- schedule with SCHED_FIFO and highest priority will prevent most context switches, interrupts can still cause temporary switches as does some system calls.
- If you got a multi cpu setup you might be able to take exclusive ownership of one of the CPU's through cpuset. This prevents other applications from using it.

Using pthread_setschedparam with SCHED_FIFO and highest priority running in SU and locking it to the core and its evil twin should secure the best latency of all of these, only a real time operating system can eliminate all context switches.

Other links:

Discussion on interrupts.

Your Linux might accept that you call sched_setscheduler, using SCHED_FIFO, but this demands you got your own PID not just a TID or that your threads are cooperative multitasking.
This might not ideal as all your threads would only be switches "voluntarily" and thereby removing flexibility for the kernel to schedule it.

Interprocess communication in 100ns

Jason · Answer

Pinning a task to specific processor will generally give better performance for the task. But, there are a lot of nuances and costs to consider when doing so.

When you force affinity, you restrict the operating system's scheduling choices. You increase cpu contention for the remaining tasks. So EVERYTHING else on the system is impacted including the operating system itself. You also need to consider that if tasks need to communicate across memory, and affinities are set to cpus that don't share cache, you can drastically increase latency for communication across tasks.

One of the biggest reasons setting task cpu affinity is beneficial though, is that it gives more predictable cache and tlb (translation lookaside buffer) behavior. When a task switches cpus, the operating system can switch it to a cpu that doesn't have access to the last cpu's cache or tlb. This can increase cache misses for the task. It's particularly an issue communicating across tasks, as it takes more time to communicate across higher level caches and worst finally memory. To measure cache statistics on linux (performance in general) I recommend using perf.

The best suggestion is really to measure before you try to fix affinities. A good way to quantify latency would be by using the rdtsc instruction (at least on x86). This reads the cpu's time source, which will generally give the highest precision. Measuring across events will give roughly nanosecond accuracy.

volatile uint64_t rdtsc() {
   register uint32_t eax, edx;
   asm volatile (".byte 0x0f, 0x31" : "=d"(edx), "=a"(eax) : : );
   return ((uint64_t) edx << 32) | (uint64_t) eax;
}

note - the rdtsc instruction needs to be combined with a load fence to ensure all previous instructions have completed (or use rdtscp)
also note - if rdtsc is used without an invariant time source (on linux grep constant_tsc /proc/cpuinfo, you may get unreliable values across frequency changes and if the task switches cpu (time source)

So, in general, yes, setting the affinity does gives lower latency, but this is not always true, and there are very serious costs when you do it.

Some additional reading...

Intel 64 Architecture Processor Topology Enumeration
What Every Programmer Needs to Know About Memory (Parts 2, 3, 4, 6, and 7)
Intel Software Developer Reference (Vol. 2A/2B)
Aquire and Release Fences
TCMalloc

should I "bind" "spinning" thread to the certain core?

Tags:

c++

multithreading

low-latency

Oleg Vazhnev

2 Answers

Other links:

Surt

Jason

Recent Activity

Donate For Us

should I "bind" "spinning" thread to the certain core?

Tags:

c++

multithreading

low-latency

Oleg Vazhnev

2 Answers

Other links:

Surt

Jason

Related questions

Recent Activity

Donate For Us