Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to prioritize (or set scheduling policy for) the 'manager' and 'worker' threads of a process?

I'm running a process (on a Linux 3.x-based OS) in which:

  • A few threads are 'manager' threads (for simplicity assume they make decisions regarding which worker threads should do what, but do not do any I/O and the amount of CPU time they need, altogether, is shorter/much shorter than the worker threads')
  • More threads are 'worker' threads: They do heavy lifting computation-wise, and I have no problem with their being preempted at any time.

It's possible that there's oversubscription (i.e. more workers threads than twice the cores on an Intel processor with HT). Now, what I'm seeing is that the 'manager' threads don't get processor time frequently enough. They're not entirely 'starved', I just want to give them a boost. So, naturally I thought about setting different thread prioritization (I'm on Linux) - but then I noticed the different choices for thread schedulers and their effect. At this point I got confused, or rather - it's not clear to me:

  • Which scheduling policy should I choose for the managers, and which for the workers?
  • What should I set the thread priorities to (if at all)?
  • Do I need to have my threads yield() occasionally?

Notes:

  • I'm intentionally not saying anything about the language or thread pool mechanism. I want to ask this question in the more general setting.
  • Please do not make assumptions about CPU cores. There may be many of them, or maybe just one, and perhaps I need workers (or workers and managers) on each core.
  • The worker threads may or may not do I/O. Answers for the case of them not doing any I/O are welcome, though.
  • I don't really need the system to be very responsive other than running my application. I mean, I'd rather be able to SSH in there and have my typing echoed to me without a significant delay, but no real restrictions there.
like image 248
einpoklum Avatar asked Jan 18 '15 07:01

einpoklum


2 Answers

UPD 12.02.2015: I have run some experiments.

Theory

There is obvious solution to change "manager" threads scheduler to RT (real-time scheduler that provides SCHED_DEADLINE/SCHED_FIFO policies). In this case "manager" threads will always have larger priority than most threads in a system, so they will almost always get CPU when they need it.

However, there is another solution that allows you to stay on CFS scheduler. Your description of purpose of "worker" threads is similiar to batch scheduling (in ancient times when computers were large, user has to put his job onto queue and wait hours till its done). Linux CFS supports batch jobs via SCHED_BATCH policy and dialog jobs via SCHED_NORMAL policy.

There is also useful comment in kernel code (kernel/sched/fair.c):

/*
 * Batch and idle tasks do not preempt non-idle tasks (their preemption
 * is driven by the tick):
 */
if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
    return;

So when "manager" thread or some other event awake "worker", latter will get CPU only if there are free CPUs in system or when "manager" will exhaust its timeslice (to tune it change the weight of task).

It seems that your problem couldn't be solved without changing of scheduler policies. If "worker" threads are very busy and "manager" are rarely wake up, they would get same vruntime bonus, so "worker" would always preempt "manager" threads (but you may increase their weight, so they would exhaust their bonus faster).

Experiment

I have a server with 2 x Intel Xeon E5-2420 CPUs which gives us 24 hardware threads. To simulate two threadpools I used my own TSLoad workload generator (and fixed couple of bugs while running experiments :) ).

There were two threadpools: tp_manager with 4 threads and tp_worker with 30 threads, both running busy_wait workloads (just for(i = 0; i < N; ++i);) but with different number of loop cycles. tp_worker works in benchmark mode so it will run as many requests as it can and occupies 100% of CPU.

Here are sample config: https://gist.github.com/myaut/ad946e89cb56b0d4acde

3.12 (vanilla with debug config)

EXP  |              MANAGER              |     WORKER
     |  sched            wait    service | sched            service
     |  policy           time     time   | policy            time
33   |  NORMAL          0.045    2.620   |     WAS NOT RUNNING
34   |  NORMAL          0.131    4.007   | NORMAL           125.192
35   |  NORMAL          0.123    4.007   | BATCH            125.143
36   |  NORMAL          0.026    4.007   | BATCH (nice=10)  125.296
37   |  NORMAL          0.025    3.978   | BATCH (nice=19)  125.223
38   |  FIFO (prio=9)  -0.022    3.991   | NORMAL           125.187
39   |  core:0:0        0.037    2.929   | !core:0:0        136.719

3.2 (stock Debian)

EXP  |              MANAGER              |     WORKER
     |  sched            wait    service | sched            service
     |  policy           time     time   | policy            time
46   |  NORMAL          0.032    2.589   |     WAS NOT RUNNING
45   |  NORMAL          0.081    4.001   | NORMAL           125.140
47   |  NORMAL          0.048    3.998   | BATCH            125.205
50   |  NORMAL          0.023    3.994   | BATCH (nice=10)  125.202
48   |  NORMAL          0.033    3.996   | BATCH (nice=19)  125.223
42   |  FIFO (prio=9)  -0.008    4.016   | NORMAL           125.110
39   |  core:0:0        0.035    2.930   | !core:0:0        135.990

Some notes:

  • All times are in milliseconds
  • Last experiment is for setting affinities (advised by @PhilippClaßen): manager threads was bound to Core #0 while worker threads was bound to all cores except Core #0.
  • Service time for manager threads increased two times, which is explainable by concurrency inside cores (processor has Hyper-Threading!)
  • Using SCHED_BATCH + nice (TSLoad cannot set directly weight, but nice can do it indirectly) slightly reduces wait time.
  • Negative wait time in SCHED_FIFO experiment is OK: TSLoad reserves 30us so it can do preliminary work / scheduler have time to do context switch / etc.. It seem that SCHED_FIFO is very fast.
  • Reserving single core isn't that bad, and because it removed in-core concurrency, service time was significantly decreased
like image 64
myaut Avatar answered Oct 05 '22 12:10

myaut


In addition to myaut's answer, you could also bind the manager to specific CPUs (sched_setaffinity) and the workers to the rest. Depending on your exact use case that can be very wasteful, of course.

Link: Thread binding the CPU core

Explicit yielding is generally not necessary, in fact often discouraged. To quote Robert Love in "Linux System Programming":

In practice, there are few legitimate uses of sched_yield() on a proper preemptive multitasking system such as Linux. The kernel is fully capable of making the optimal and most efficient scheduling decisions - certainly, the kernel is better equipped than an individual application to decide what to preempt and when.

The exception that he mentions, is when you are waiting on external events, for example, caused by the user, hardware or by another process. That is not the case, in your example.

like image 43
Philipp Claßen Avatar answered Oct 05 '22 11:10

Philipp Claßen