Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linux thread scheduling differences on multi-core systems?

We have several latency-sensitive "pipeline"-style programs that have a measurable performance degredation when run on one Linux kernel versus another. In particular, we see better performance with the 2.6.9 CentOS 4.x (RHEL4) kernel, and worse performance with the 2.6.18 kernel from CentOS 5.x (RHEL5).

By "pipeline" program, I mean one that has multiple threads. The mutiple threads work on shared data. Between each thread, there is a queue. So thread A gets data, pushes into Qab, thread B pulls from Qab, does some processing, then pushes into Qbc, thread C pulls from Qbc, etc. The initial data is from the network (generated by a 3rd party).

We basically measure the time from when the data is received to when the last thread performs its task. In our application, we see an increase of anywhere from 20 to 50 microseconds when moving from CentOS 4 to CentOS 5.

I have used a few methods of profiling our application, and determined that the added latency on CentOS 5 comes from queue operations (in particular, popping).

However, I can improve performance on CentOS 5 (to be the same as CentOS 4) by using taskset to bind the program to a subset of the available cores.

So it appers to me, between CentOS 4 and 5, there was some change (presumably to the kernel) that caused threads to be scheduled differently (and this difference is suboptimal for our application).

While I can "solve" this problem with taskset (or in code via sched_setaffinity()), my preference is to not have to do this. I'm hoping there's some kind of kernel tunable (or maybe collection of tunables) whose default was changed between versions.

Anyone have any experience with this? Perhaps some more areas to investigate?

Update: In this particular case, the issue was resolved by a BIOS update from the server vendor (Dell). I pulled my hair out quite a while on this one. Until I went back to the basics, and checked my vendor's BIOS updates. Suspiciously, one of the updates said something like "improve performance in maximum performance mode". Once I upgraded the BIOS, CentOS 5 was faster---generally speaking, but particularly in my queue tests, and actual production runs.

like image 847
Matt Avatar asked May 24 '11 16:05

Matt


People also ask

Do multiple threads run on different cores?

Yes, threads and processes can run concurrently on multi-core CPUs, so this works as you describe (regardless of how you create those threads and processes, OpenMP or otherwise). A single process or thread only runs on a single core at a time.

Can user threads run on different cores?

In short: yes, a thread can run on different cores. Not at the same time, of course - it's only one thread of execution — but it could execute on core C0 at time T0, and then on core C1 at time T1.

What thread scheduling does Linux use?

The Linux scheduler supports the SCHED_FIFO scheduling policy defined by POSIX. 1-2001. Threads scheduled with this “real-time” policy can be assigned a priority (under Linux) in the range 1.. 99 with 99 representing the highest priority.

Is Linux multi threaded?

In Linux terminology, simultaneous multithreading is also known as SMT or Hyper-Threading. With multithreading enabled, a single core on the hardware is mapped to multiple logical CPUs on Linux. Thus, multiple threads can issue instructions to a core simultaneously during each cycle.


1 Answers

Hmm.. if the time taken for a pop() operation from a producer-consumer queue is making a significant difference to the overall performance of your app, I would suggest that the structure of your threads/workFlow is not optimal, somewhere . Unless there is a huge amount of contention on the queues, I would be surprised if any P-C queue push/pop on any modern OS would take more than a µS or so, even if the queue uses kernel locks in a classic 'Computer Science 117 - how to make a bounded P-C queue with three semaphores' manner.

Can you just absorb the functionality of the thread/s that do the least work into those that do the most, so reducing the number of push/pop per overall work item that flows through your system?

like image 146
Martin James Avatar answered Sep 24 '22 22:09

Martin James