Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determine syscalls or subsystems a process is spending time waiting in

Tags:

I'm looking for ways to learn which syscalls or which subsystems a process or thread spends time waiting in, i.e. blocked and not scheduled to run on a CPU.

Specifically if I have some unknown process, or a process where all we know is "it's slow" I'd like to be able to learn things like:

  • "it spends 80% of its time in sys_write() on fd 13 which is /some/file"
  • "it's spending a lot of time waiting to read() from a network socket"
  • "it's sleeping in epoll_wait() for activity on fds [4,5,6] which are [file /boo], [socket 10.1.1.:42], [notifyfd blah]"

In other words when my program is not running on the CPU what is it doing?

This is astonishingly hard to answer with perf because it does not appear to have any way to record the duration of a syscall from sys_enter to sys_exit or otherwise keep track of how long an event is. Presumably due to its sampling nature.

I'm aware of some experimental work with eBPF for Linux 4.6 and above that may help, with Brendan Gregg's off-cpu work. But in the sad world of operations and support a 4.6 kernel is a rare unicorn to be treasured.

What are the real world options?

Do ftrace, systemtap etc offer any insights here?

like image 595
Craig Ringer Avatar asked Feb 08 '19 16:02

Craig Ringer


People also ask

What is system call explain all the steps with example?

A system call is a request from computer software to an operating system's kernel. The Application Program Interface (API) connects the operating system's functions to user programs. It acts as a link between the operating system and a process, allowing user-level programs to request operating system services.

How does the CPU respond to system call?

When responding to system calls, other traps/exceptions, and interrupts, OS code is run. The CPU automatically switches to monitor mode whenever an interrupt or trap occurs. So, the OS code is run in monitor mode. Input/output protection: Input/output is protected by making all input/output instructions privileged.

What is system call how system calls are used in application program?

A system call is a routine that allows a user application to request actions that require special privileges. Adding system calls is one of several ways to extend the functions provided by the kernel.

What is Strace command?

strace is a diagnostic tool in Linux. It intercepts and records any syscalls made by a command. Additionally, it also records any Linux signal sent to the process. We can then use this information to debug or diagnose a program. It's especially useful if the source code of the command is not readily available.


1 Answers

You can use strace. First, you might want to get a high-level summary of the costs of each type of system call. You can obtain this summary by running strace -c. For example, one possible output is the following:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 90.07    0.000263          26        10           getdents
  3.42    0.000010           0      1572           read
  3.42    0.000010           0       762         1 stat
  3.08    0.000009           0      1574         6 open
  0.00    0.000000           0        11           write
  0.00    0.000000           0      1569           close
  0.00    0.000000           0        48           fstat

The % time value is with respect to overall kernel time, not overall execution time (kernel+user). This summary tells you what the most expensive system calls are. However, if you need to determine which specific instances of system calls are most expensive and what arguments are passed to them, you can run strace -i -T. The -i option shows the instruction addresses of the instructions that performed the system call and the -T option the time spent in the system call. An output might look like this:

[00007f97f1b37367] open("myfile", O_RDONLY|O_CLOEXEC) = 3 <0.000020>
[00007f97f1b372f4] fstat(3, {st_mode=S_IFREG|0644, st_size=159776, ...}) = 0 <0.000018>
[00007f97f1b374ba] mmap(NULL, 159776, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f97f1d19000 <0.000019>
[00007f97f1b37467] close(3)             = 0 <0.000018>

The first column shows instruction addresses, the second column shows systems calls with their arguments, the third column shows the returned value, and the last column shows the time spent in that system call. This list is ordered by the dynamic occurrence of the system call. You can filter this output using either grep or the -e option. The instruction addresses can help you locate where in the source code these system calls are made. For example, if a long sequence of system calls have the same address, then there is a good chance that you have a loop somewhere in the code that contains the system call. If your executable binary is not PIE, the dynamic addresses are the same as the static addresses shown by objdump. But even with PIE, the relative order of the dynamic addresses is the same. I don't know if there is an easy way to map these system calls to source code lines.

If you want to find out things like "it spends 80% of its time in sys_write() on fd 13 which is /some/file" then you need to write a script that first extracts the return values of all open calls and the corresponding file name arguments and then sum up the times of all sys_write calls whose fd argument is equal to some value.

like image 178
Hadi Brais Avatar answered Sep 28 '22 19:09

Hadi Brais