How would I follow a system call from a trap to the kernel, to how arguments are passed, to how the system call in located in the kernel, to the actual processing of the system call in the kernel, to the return back to the user and how state is restored?
The tracing tools on Linux are strace and ltrace. The command man strace displays a full set of available options. The strace tool traces system calls. You can either use it on a process that is already available, or start it with a new process.
strace is a Linux utility that lets you trace the system calls that a given application makes.
The specific system call being invoked is stored in the EAX register, abd its arguments are held in the other processor registers.
SystemTap
This is the most powerful method I've found so far. It can even show the call arguments: Does ftrace allow capture of system call arguments to the Linux kernel, or only function names?
Usage:
sudo apt-get install systemtap
sudo stap -e 'probe syscall.mkdir { printf("%s[%d] -> %s(%s)\n", execname(), pid(), name, argstr) }'
Then on another terminal:
sudo rm -rf /tmp/a /tmp/b
mkdir /tmp/a
mkdir /tmp/b
Sample output:
mkdir[4590] -> mkdir("/tmp/a", 0777)
mkdir[4593] -> mkdir("/tmp/b", 0777)
Documentation: https://sourceware.org/systemtap/documentation.html
Seems to be kprobes based: https://sourceware.org/systemtap/archpaper.pdf
See also: How to trace just system call events with ftrace without showing any other functions in the Linux kernel?
Tested on Ubuntu 18.04, Linux kernel 4.15.
ltrace -S
shows both system calls and library calls
This awesome tool therefore gives even further visibility into what executables are doing.
Here for example I used it to analyze what system calls dlopen
is making: https://unix.stackexchange.com/questions/226524/what-system-call-is-used-to-load-libraries-in-linux/462710#462710
ftrace
minimal runnable example
Mentioned at https://stackoverflow.com/a/29840482/895245 but here goes a minimal runnable example.
Run with sudo
:
#!/bin/sh
set -eux
d=debug/tracing
mkdir -p debug
if ! mountpoint -q debug; then
mount -t debugfs nodev debug
fi
# Stop tracing.
echo 0 > "${d}/tracing_on"
# Clear previous traces.
echo > "${d}/trace"
# Find the tracer name.
cat "${d}/available_tracers"
# Disable tracing functions, show only system call events.
echo nop > "${d}/current_tracer"
# Find the event name with.
grep mkdir "${d}/available_events"
# Enable tracing mkdir.
# Both statements below seem to do the exact same thing,
# just with different interfaces.
# https://www.kernel.org/doc/html/v4.18/trace/events.html
echo sys_enter_mkdir > "${d}/set_event"
# echo 1 > "${d}/events/syscalls/sys_enter_mkdir/enable"
# Start tracing.
echo 1 > "${d}/tracing_on"
# Generate two mkdir calls by two different processes.
rm -rf /tmp/a /tmp/b
mkdir /tmp/a
mkdir /tmp/b
# View the trace.
cat "${d}/trace"
# Stop tracing.
echo 0 > "${d}/tracing_on"
umount debug
Sample output:
# tracer: nop
#
# _-----=> irqs-offhttps://sourceware.org/systemtap/documentation.html
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
mkdir-5619 [005] .... 10249.262531: sys_mkdir(pathname: 7fff93cbfcb0, mode: 1ff)
mkdir-5620 [003] .... 10249.264613: sys_mkdir(pathname: 7ffcdc91ecb0, mode: 1ff)
One cool thing about this method is that it shows the function call for all processes on the system at once, although you can also filter PIDs of interest with set_ftrace_pid
.
Documentation at: https://www.kernel.org/doc/html/v4.18/trace/index.html
Tested on Ubuntu 18.04, Linux kernel 4.15.
GDB step debug the Linux kernel
Depending on the level of internals detail you need, this is an option: How to debug the Linux kernel with GDB and QEMU?
strace
minimal runnable example
Here is a minimal runnable example of strace
: How should strace be used? with a freestanding hello world, which makes how everything works perfectly clear.
More info
https://en.pingcap.com/blog/how-to-trace-linux-system-calls-in-production-with-minimal-impact-on-performance might be worth a read, it mentions:
perf top -F 49 -e raw_syscalls:sys_enter --sort comm,dso --show-nr-samples
and the BPF-based traceloop: https://github.com/kinvolk/traceloop which the article claims to be a very fast method:
sudo -E ./traceloop cgroups --dump-on-exit /sys/fs/cgroup/system.slice/sshd.service
It's actually relatively easy to use ftrace
. Here's a classic article by Steven, "Mr. ftrace", Rostedt. The second part is here.
There is a free video by Jan-Simon Möller of the Linux Foundation, and many other good introductory articles that you can find using search terms such as "ftrace tutorial" or "ftrace example".
You can use the -f and -ff option. Something like this:
strace -f -e trace=process bash -c 'ls; :'
-f Trace child processes as they are created by currently traced processes as a result of the fork(2) system call.
-ff If the -o filename option is in effect, each processes trace is written to filename.pid where pid is the numeric process id of each process. This is incompatible with -c, since no per-process counts are kept.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With