Cannot avoid context-switches on a process launched alone on a CPU

Q: Is it possible for a context switch to occur without a mode switch?

(c) A context switch can occur without a mode switch.

Tags:

I am investigating how run a process on a dedicated CPU in order to avoid context-switches. On my Ubuntu, I isolated two CPUs using the kernel parameters "isolcpus=3,7" and "irqaffinity=0-2,4-6". I am sure that it is correctly taken into account:

$ cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-4.8.0-27-generic root=UUID=58c66f12-0588-442b-9bb8-1d2dd833efe2 ro quiet splash isolcpus=3,7 irqaffinity=0-2,4-6 vt.handoff=7

After a reboot, I can check that everything works as expected. On a first console I run

$ stress -c 24
stress: info: [31717] dispatching hogs: 24 cpu, 0 io, 0 vm, 0 hdd

And on a second one, using "top" I can check the usage of my CPUs:

top - 18:39:07 up 2 days, 20:48, 18 users,  load average: 23,15, 10,46, 4,53
Tasks: 457 total,  26 running, 431 sleeping,   0 stopped,   0 zombie
%Cpu0  :100,0 us,  0,0 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu1  : 98,7 us,  1,3 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu2  : 99,3 us,  0,7 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu3  :  0,0 us,  0,0 sy,  0,0 ni,100,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu4  : 95,7 us,  4,3 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu5  : 98,0 us,  2,0 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu6  : 98,7 us,  1,3 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu7  :  0,0 us,  0,0 sy,  0,0 ni,100,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem :  7855176 total,   385736 free,  5891280 used,  1578160 buff/cache
KiB Swap: 15624188 total, 10414520 free,  5209668 used.   626872 avail Mem

CPUs 3 and 7 are free while the 6 other ones are fully busy. Fine.

For the rest of my test, I will use a small application that does almost pure processing

It uses two int buffers of the same size

It reads one-by-one all the values of the first buffer

each value is a random index in the second buffer

It reads the value at the index in the second buffer

It sums all the values taken from the second buffer

It does all the previous steps for bigger and bigger

At the end, I print the number of voluntary and involuntary CPU context switches

I am now studying my application when I launch it:

on a non-isolated CPU
on an isolated CPU

I do it via the following command lines:

$ ./TestCpuset              ### launch on any non-isolated CPU
$ taskset -c 7 ./TestCpuset ### launch on isolated CPU 7

When launched on any CPU, the numbers of context switches change from 20 to... thousands

When launched on an isolated CPU, the number of context switches is almost constant (between 10 and 20), even if I launch in parallel a "stress -c 24".(looks quite normal)

But my question is: why isn't it 0 absolutely 0? When a switch is done on a process, it is in order to replace it by another process? But in my case there is no other process to replace with!

I have an hypothesis which is that the "isolcpus" option would isolate CPU form any process (unless the process an CPU affinity would be given, such as what is done with "taskset") but not from kernel tasks. However, I found no documentation about it

I would appreciate any help in order to reach 0 context-switches

FYI, this question is closed to another one I previously opened: Cannot allocate exclusively a CPU for my process

Here is the code of the program I am using:

#include <limits.h>
#include <iostream>
#include <unistd.h>
#include <sys/time.h>
#include <sys/resource.h>

const unsigned int BUFFER_SIZE = 4096;

using namespace std;


class TimedSumComputer
{

public:
  TimedSumComputer() :
    sum(0),
    bufferSize(0),
    valueBuffer(0),
    indexBuffer(0)
  {}


public:
  virtual ~TimedSumComputer()
  {
    resetBuffers();
  }


public:
  void init(unsigned int bufferSize)
  {
    this->bufferSize = bufferSize;
    resetBuffers();
    initValueBuffer();
    initIndexBuffer();
  }


private:
  void resetBuffers() 
  {
    delete [] valueBuffer;
    delete [] indexBuffer;
    valueBuffer = 0;
    indexBuffer = 0;
  }


  void initValueBuffer()
  {
    valueBuffer = new unsigned int[bufferSize];
    for (unsigned int i = 0 ; i < bufferSize ; i++)
    {
      valueBuffer[i] = randomUint();
    }
  }


  static unsigned int randomUint()
  {
    int value = rand() % UINT_MAX;
    return value;
  }


protected:
  void initIndexBuffer()
  {
    indexBuffer = new unsigned int[bufferSize];
    for (unsigned int i = 0 ; i < bufferSize ; i++)
    {
      indexBuffer[i] = rand() % bufferSize;
    }
  }


public:
  unsigned int getSum() const
  {
    return sum;
  }


  unsigned int computeTimeInMicroSeconds()
  {
    struct timeval startTime, endTime;

    gettimeofday(&startTime, NULL);
    unsigned int sum = computeSum();
    gettimeofday(&endTime, NULL);

    return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);
  }


  unsigned int computeSum()
  {
    sum = 0;

    for (unsigned int i = 0 ; i < bufferSize ; i++)
    {
      unsigned int index = indexBuffer[i];
      sum += valueBuffer[index];
    }

    return sum;
  }


protected:
  unsigned int sum;
  unsigned int bufferSize;
  unsigned int * valueBuffer;
  unsigned int * indexBuffer;

};



unsigned int runTestForBufferSize(TimedSumComputer & timedComputer, unsigned int bufferSize)
{
  timedComputer.init(bufferSize);

  unsigned int timeInMicroSec = timedComputer.computeTimeInMicroSeconds();
  cout << "bufferSize = " << bufferSize << " - time (in micro-sec) = " << timeInMicroSec << endl;
  return timedComputer.getSum();
}



void runTest(TimedSumComputer & timedComputer)
{
  unsigned int result = 0;

  for (unsigned int i = 1 ; i < 10 ; i++)
  {
    result += runTestForBufferSize(timedComputer, BUFFER_SIZE * i);
  }

  unsigned int factor = 1;
  for (unsigned int i = 2 ; i <= 6 ; i++)
  {
    factor *= 10;
    result += runTestForBufferSize(timedComputer, BUFFER_SIZE * factor);
  }

  cout << "result = " << result << endl;
}



void printPid()
{
  cout << "###############################" << endl;
  cout << "Pid = " << getpid() << endl;
  cout << "###############################" << endl;
}



void printNbContextSwitch()
{
  struct rusage usage;
  getrusage(RUSAGE_THREAD, &usage);
  cout << "Number of voluntary context switch:   " << usage.ru_nvcsw << endl;
  cout << "Number of involuntary context switch: " << usage.ru_nivcsw << endl;
}



int main()
{
  printPid();

  TimedSumComputer timedComputer;
  runTest(timedComputer);

  printNbContextSwitch();

  return 0;
}

411

asked Nov 23 '16 21:11

Philippe MESMEUR

3 Answers

Today, I obtained more clues regarding my problem I realized that I had to investigate deeply what was happening in the Kernel scheduler. I found these two pages:

Ftrace Linux Kernel Tracing
ftrace - Function Tracer

I enabled scheduler tracing while my application was running like that:

# sudo bash
# cd /sys/kernel/debug/tracing
# echo 1 > options/function-trace ; echo function_graph > current_tracer ; echo 1 > tracing_on ; echo 0 > tracing_max_latency ; taskset -c 7 [path-to-my-program]/TestCpuset ; echo 0 > tracing_on
# cat trace

As my program was launched on CPU 7 (taskset -c 7), I have to filter the "trace" output

# grep " 7)" trace

I can then search for transitions, from one process to another one:

# grep " 7)" trace | grep "=>"
 ...
 7)  TestCpu-4753  =>  kworker-5866 
 7)  kworker-5866  =>  TestCpu-4753 
 7)  TestCpu-4753  =>   watchdo-26  
 7)   watchdo-26   =>  TestCpu-4753 
 7)  TestCpu-4753  =>  kworker-5866 
 7)  kworker-5866  =>  TestCpu-4753 
 7)  TestCpu-4753  =>  kworker-5866 
 7)  kworker-5866  =>  TestCpu-4753 
 7)  TestCpu-4753  =>  kworker-5866 
 7)  kworker-5866  =>  TestCpu-4753 
 ...

Bingo! It seems that the context switches I am tracking are transitions to:

kworker
watchdog

I now have to find:

what are exactly these processes/threads? (it seems that they are handled by the kernel)
Can I avoid them to run on my dedicated CPUs?

For course, once again I would appreciate any help :-P

answered Oct 26 '22 07:10

Philippe MESMEUR

Potentially any syscall could involve context a switch. When you access paged out memory it may increase context switch count too. To reach 0 context switches you would need to force kernel to keep all the memory your program uses mapped to its address space, and you would need to be sure that none of syscalls you invoke entails a context switch. I believe it may be possible on kernels with RT patches, but probably hard to achieve on standard distro kernel.

answered Oct 26 '22 06:10

Robert Baldyga

For the sake of those finding this via google (like me), /sys/devices/virtual/workqueue/cpumask controls where the kernel may queue works queued with WORK_CPU_UNBOUND (Don't care which cpu). As of writing this answer, it's not set to the same mask as the one isolcpus manipulates by default.

Once I changed it to not include my isolated cpus, I saw a significantly smaller (but not zero) amount of context switches to my critical threads. I assume that the works that did run on my isolated cpus must have requested it specifically, such as by using schedule_on_each_cpu.

answered Oct 26 '22 06:10

talshorer

Related questions
                            
                                Java: Difference between macro and micro benchmarks
                            
                                postgresql hstore key/value vs traditional SQL performance
                            
                                @ServiceHost Debug="true" - performance penalty?
                            
                                strlen performance implementation
                            
                                Java regular expression offers any performance benefit?
                            
                                Python -- efficiency of caught exceptions [duplicate]
                            
                                Efficient particle system in javascript? (WebGL)
                            
                                The Fastest Way to Batch Calls in WebGL
                            
                                Recalculate Style: why so stuttering?
                            
                                Clojure performance: REPL versus uberjar
                            
                                Why would parallelization decrease performance so dramatically?
                            
                                De Bruijn algorithm binary digit count 64bits C#
                            
                                Best solution to host a crawler? [closed]
                            
                                Android triple buffering - expected behavior?
                            
                                Performance issues with App Engine memcache / ndb.get_multi
                            
                                Parallel.For() slows down with repeated execution. What should I look at?
                            
                                Certain Power of Sum of Digits of N == N (running too slowly)
                            
                                Graphics.Transform is massively inefficient, what can I do about this?
                            
                                golang slice allocation performance
                            
                                Most efficient way to determine overlapping timeseries in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cannot avoid context-switches on a process launched alone on a CPU

Tags:

performance

linux-kernel

context-switch

scheduler

affinity

Philippe MESMEUR

People also ask

3 Answers

Philippe MESMEUR

Robert Baldyga

talshorer

Recent Activity

Donate For Us