Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Impossible Cpu time for process / thread with rusage

I am calculating the delta between the cpu time a process / thread is using, between the calling of a certain function.

I am getting impossible value for the process cpu time, some time's the next call is returning smaller cpu time, which is impossible.

this is the code i am using to calculate the cpu time:

  u64 CpuTime::calculateCpuTime(bool a_thread)
  {
    struct rusage l_rusage;
    int retVal;
    if(a_thread)
    {
      retVal = getrusage(1, &l_rusage); //1 = RUSAGE_THREAD
    }
    else
    {
      retVal = getrusage(0, &l_rusage); //0 = RUSAGE_SELF
    }
    ASSERT(retVal==0);

    u64 userUSeconds = (static_cast<u64>(l_rusage.ru_utime.tv_sec)*1000000)+
      (static_cast<u64>(l_rusage.ru_utime.tv_usec));

    u64 systemUSeconds = (static_cast<u64>(l_rusage.ru_stime.tv_sec)*1000000)+
      (static_cast<u64>(l_rusage.ru_stime.tv_usec));

    if(a_thread)
    {
      return userUSeconds + systemUSeconds;
    }
    return (userUSeconds + systemUSeconds) / ACE_OS::num_processors_online();
  }

Also the function calculating retrieving this is async, no more then 1 thread can access it at a time.

this is an example of the impossible output (process):

2016/03/23 13:39:05.187 - #2 - 15471/15055 - ThreadDeltaCpuTime = 6000 ThreadCpuTime = 46756013 ProcessCpuTime = 39820759 deltaProcessCpuTime= 250

2016/03/23 13:39:05.386 - #2 - 15470/15055 - ThreadDeltaCpuTime = 5000 ThreadCpuTime = 46663013 ProcessCpuTime = 39699759 deltaProcessCpuTime = 18446744073709430616

The delta calculation is a simple newValue - oldValue, with a unsigned int, it leads to those numbers.

Example for a thread impossible output:

2016/03/23 13:38:59.183 - #2 - 15472/15055 - ThreadDeltaCpuTime = 6000 ThreadCpuTime = 47541012 ProcessCpuTime = 39658134 deltaProcessCpuTime = 5250

2016/03/23 13:38:59.382 - #2 - 15472/15055 - ThreadDeltaCpuTime = 1844 6744073708557617 ThreadCpuTime = 46547013 ProcessCpuTime = 39663009 deltaProcessCpuTime = 4625

i have bolded the thread id, also you can see the function has been executed 200ms after the first 1.

I am looking for a clue for why this is happening, i have tried using the ACE_OS rusage as well, with same results (make sense), i could solve this by ignoring "bad" values, but i am wondering why this is happening.

EDIT:

This is where i do the actual printing:

  u64 computeCpuTime()
  {
    u64 deltaCpuTime= CpuTime::getThreadDeltaCpuTime();
    u64 CpuTime= getThreadCpuTime();
    u64 ProcessCpuTime= getProcessCpuTime();
    u64 deltaProcessCpuTime= CpuTime::getProcessDeltaCpuTime();
    Log<<value(deltaCpuTime)<<value(CpuTime)<<value(ProcessCpuTime)<<value(deltaProcessCpuTime)<<endlog;
    return deltaCpuTime;
  }

And the rest of the functions used in this calculation:

  u64 CpuTime::getThreadDeltaCpuTime()
  {
    pid_t thisThread = (pid_t) syscall (SYS_gettid);
    u64& val = m_threadsCpuMap[thisThread];
    u64 oldValue =val;
    val = calculateCpuTime(true);
    return val - oldValue;
  }

  u64 CpuTime::getProcessDeltaCpuTime()
  {
    u64 oldValue = m_processCpu;
    m_processCpu = calculateCpuTime(false);

    return m_processCpu - oldValue;
  }

  u64 getThreadCpuTime()
  {
    return CpuTime::calculateCpuTime(true);
  }

  u64 getProcessCpuTime()
  {
    return CpuTime::calculateCpuTime(false);
  }

The object doing the system call ("m_cpuTime") is a singleton, protected with a mutex, only 1 thread can access it at a time.

it holds the map of the threads cpu usage, m_threadsCpuMap (used for the delta), and also the last process cpu usage, m_processCpu.

ANOTHER EDIT:

I have simplified it with a simple test, working with a single thread, removed the division by active cpu's, and am only checking the Process Cpu time, still with impossible results.

Here is the updated functions:

Test:

for(int i = 0; i < 100000 ; i++)
{
  for(int k = 0; k < 1000000; k++)
    m = k % i;
    cpuTime = CpuTime::instance()->getProcessDeltaCpuTime();
}

Functions used in test:

  u64 CpuTime::getProcessDeltaCpuTime()
  {
    u64 oldValue = m_processCpu;
    m_processCpu = calculateCpuTime(eThisProcess);
   Log<<value(oldValue)<<value(m_processCpu)<value( m_processCpu - oldValue)<<endlog;
    return m_processCpu - oldValue;
  }

  u64 CpuTime::calculateCpuTime(int a_type)
  {
    struct rusage l_rusage;
    int retVal;
    if(a_type == eThisThread)
    {
      retVal = /*ACE_OS::*/getrusage(1, &l_rusage);
    }
    else
    {
      retVal = /*ACE_OS::*/getrusage(0, &l_rusage);
    }

    u64 userUSeconds = (static_cast<u64>(l_rusage.ru_utime.tv_sec)*1000000)+
      (static_cast<u64>(l_rusage.ru_utime.tv_usec));

    u64 systemUSeconds = (static_cast<u64>(l_rusage.ru_stime.tv_sec)*100000)+
      (static_cast<u64>(l_rusage.ru_stime.tv_usec));

    if(a_type == eThisThread)
    {
      return userUSeconds + systemUSeconds;
    }
    return (userUSeconds + systemUSeconds)/* / ACE_OS::num_processors_online()*/;

This is an example of the logs, i have removed all the times the delta was ~0

2016/03/29 08:07:05.198 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14797750 m_processCpu = 14798749 m_process Cpu - oldValue = 999

2016/03/29 08:07:05.199 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14798749 m_processCpu = 14799749 m_process Cpu - oldValue = 1000

2016/03/29 08:07:05.200 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14799749 m_processCpu = 14800749 m_process Cpu - oldValue = 1000

2016/03/29 08:07:05.201 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14800749 m_processCpu = 14801749 m_process Cpu - oldValue = 1000

2016/03/29 08:07:05.202 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14801749 m_processCpu = 14802749 m_process Cpu - oldValue = 1000

2016/03/29 08:07:05.203 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14802749 m_processCpu = 13903748 m_process Cpu - oldValue = 18446744073708652615

2016/03/29 08:07:05.204 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 13903748 m_processCpu = 13904748 m_process Cpu - oldValue = 1000

2016/03/29 08:07:05.205 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 13904748 m_processCpu = 13905748 m_process Cpu - oldValue = 1000

This time there was only 1 time of bad values, out of 100000 iterations. And after it all the following calculation were also smaller, e.g. it went like this: 1 2 3 4 5 6 7 8 9 10 bad value, 6 7 8 9 10... As if the process lost count, went back, and then continued normally.

YET ANOUTHER EDIT:

I have separated the code, completely, compiled it as simple as possible. and the problem did not occur, it could be a matter of probability (it was 1/100000 before). this is the code:

int main()
{
  long int n = 0;
  long int oldValue = 0;
  long int newValue = 0;
  unsigned long int deltaValue = 0;

  for(int i = 0; i < 1000000; i++)
  {
    for(long int m = 0; m <10000; m++)
      n = m + i;
    struct rusage l_rusage;
    int retVal;
    retVal = getrusage(0, &l_rusage);
    if(retVal != 0) return 0;

    long int userUSeconds = l_rusage.ru_utime.tv_sec*1000000 + l_rusage.ru_utime.tv_usec;

    long int systemUSeconds = l_rusage.ru_stime.tv_sec*100000 + l_rusage.ru_stime.tv_usec;
    oldValue = newValue;
    newValue = userUSeconds + systemUSeconds;
    deltaValue = newValue - oldValue;
    if(deltaValue != 0)
      std::cout<<"Old value: "<< oldValue <<" New Value: "<< newValue <<" Delta value: "<<deltaValue<<"\n";


  }
  std::cout<<n;
  return 0;
}

And from further investigating the earlier test, (which use our unittest framework), the problem occurs always at around the same CPU time. Which mean this is happening for a reason, i just cant figure out what happens at that exact time which could cause some thing like this. Or how some thing like this could be caused.

I am not familiar enough with linux kernel to understand how this calculation is made.

Another insight - when using clock() to get the process. it does not happen, even more - it seems like until that moment the values were around the same, after this point they are not:

2016/03/29 12:36:19.158 - #2 - 20544/20544 - CpuTime::getProcessDeltaCpuTime: oldValue = 14598780 m_processCpu = 14598780 m_processCpu - oldValue = 0 oldClockTime = 14580000 m_clockTime = 14580000 m_clockTime - oldClockTime = 0

2016/03/29 12:36:19.158 - #2 - 20544/20544 - CpuTime::getProcessDeltaCpuTime: oldValue = 14598780 m_processCpu = 13699779 m_processCpu - oldValue = 18446744073708652615 oldClockTime = 14580000 m_clockTime = 14590000 m_clockTime - oldClockTime = 10000

2016/03/29 12:36:19.158 - #2 - 20544/20544 - CpuTime::getProcessDeltaCpuTime: oldValue = 13699779 m_processCpu = 13699779 m_processCpu - oldValue = 0 oldClockTime = 14590000 m_clockTime = 14590000 m_clockTime - oldClockTime = 0

Which leads me to a different question, how is rusage calculated, and how is clock()? what could lead to the difference between them?

I have worked around the problem, measuring it using clock_gettime, here is the code if any 1 is interested:

  u64 CpuMeasure::calculateCpuTime(int a_type)
  {
    struct timespec ts;

    if(a_type == eThisThread)
    {
      if (ACE_OS::clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts) == 0)
      {
         return (u64)ts.tv_sec * 1000000 + (u64)ts.tv_nsec / 1000; //From nano to mili seconds
      }
    }
    else
    {
      if (ACE_OS::clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts) == 0)
      {
         return (u64)ts.tv_sec * 1000000 + (u64)ts.tv_nsec / 1000;
      }
    }
    ASSERT(false); //we cant be here, unless some thing bad happened
    return 0;
}

I am using ACE_OS for portability, but it works with the default functions from sys/time.h

Yet, i am still in question about rusage strange behavior , what could cause it to give those values.

like image 809
Dinari Avatar asked Mar 23 '16 14:03

Dinari


1 Answers

Check carefully these lines: you are multiplying the system time seconds by 100 thousand instead of 1 million:

u64 userUSeconds = (static_cast<u64>(l_rusage.ru_utime.tv_sec)*1000000)+
  (static_cast<u64>(l_rusage.ru_utime.tv_usec));

u64 systemUSeconds = (static_cast<u64>(l_rusage.ru_stime.tv_sec)*100000)+
  (static_cast<u64>(l_rusage.ru_stime.tv_usec));

They are there in two of three examples you have shown (not in the first one).

However this would explain the odd sequence:

2016/03/29 08:07:05.201 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14800749 m_processCpu = 14801749 m_process Cpu - oldValue = 1000

2016/03/29 08:07:05.202 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14801749 m_processCpu = 14802749 m_process Cpu - oldValue = 1000

2016/03/29 08:07:05.203 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14802749 m_processCpu = 13903748 m_process Cpu - oldValue = 18446744073708652615

2016/03/29 08:07:05.204 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 13903748 m_processCpu = 13904748 m_process Cpu - oldValue = 1000

at every iteration it adds 1000 us. But I guess that as soon as the system time surpass one second, only 1/10 of second is added, resulting in -0.9 seconds shift that is exactly 14.8 -> 13.9 seconds.

in fact:

 18446744073708652615 - 2^64 + 1 = -899000

that is -900000 (0.9 seconds) + 1000 us

like image 115
Sigi Avatar answered Nov 08 '22 23:11

Sigi