Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does perf's option to measure events at user and kernel levels mean?

The Linux perf tool provides access to CPU event counters. It lets you specify the events to be counted and when to count those events.

https://perf.wiki.kernel.org/index.php/Tutorial

By default, events are measured at both user and kernel levels:

perf stat -e cycles dd if=/dev/zero of=/dev/null count=100000

To measure only at the user level, it is necessary to pass a modifier:

perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000

To measure both user and kernel (explicitly):

perf stat -e cycles:uk dd if=/dev/zero of=/dev/null count=100000

From this, I expected that cycles:u meant "only count events while running non-kernel code" and recorded counts would not map to kernel symbols but that doesn't seem to be the case.

Here's an example:

perf record -e cycles:u du -sh ~
[...]
perf report --stdio -i perf.data
[...]
9.24%  du       [kernel.kallsyms]  [k] system_call
[...]
0.70%  du       [kernel.kallsyms]  [k] page_fault
[...]

If I do the same but use cycles:uk then I do get more kernel symbols reported so the event modifiers do have an effect. Using cycles:k produces reports with almost exclusively kernel symbols but it does include a few libc symbols.

What's going on here? Is this the expected behavior? Am I misunderstanding the language used in the linked document?

The linked document also includes this table which uses slightly different descriptions if that helps:

Modifiers | Description                          | Example
----------+--------------------------------------+----------
u         | monitor at priv level 3, 2, 1 (user) | event:u
k         | monitor at priv level 0 (kernel)     | event:k

Edit: more info:

CPU is an Intel Haswell. The specific model is an i7-5820K.

Distro is up to date Arch Linux (rolling release schedule) with kernel 4.1.6.

The version of perf itself is 4.2.0.

Edit2:

More output from example runs. As you can see, cycles:u mostly reports non-kernel symbols. I know that perf sometimes mis-attributes counts to a neighboring instruction when you look at the annotated assembly output. Maybe this is related?

cycles:u

# perf record -e cycles:u du -sh ~
179G    /home/khouli
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.116 MB perf.data (2755 samples) ]
# sudo perf report --stdio -i perf.data
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 2K of event 'cycles:u'
# Event count (approx.): 661835375
#
# Overhead  Command  Shared Object      Symbol
# ........  .......  .................  ..............................
#
    11.02%  du       libc-2.22.so       [.] _int_malloc
     9.73%  du       libc-2.22.so       [.] _int_free
     9.24%  du       du                 [.] fts_read
     9.23%  du       [kernel.kallsyms]  [k] system_call
     4.17%  du       libc-2.22.so       [.] strlen
     4.17%  du       libc-2.22.so       [.] __memmove_sse2
     3.47%  du       libc-2.22.so       [.] __readdir64
     3.33%  du       libc-2.22.so       [.] malloc_consolidate
     2.87%  du       libc-2.22.so       [.] malloc
     1.83%  du       libc-2.22.so       [.] msort_with_tmp.part.0
     1.63%  du       libc-2.22.so       [.] __memcpy_avx_unaligned
     1.63%  du       libc-2.22.so       [.] __getdents64
     1.52%  du       libc-2.22.so       [.] free
     1.47%  du       libc-2.22.so       [.] __memmove_avx_unaligned
     1.44%  du       du                 [.] 0x000000000000e609
     1.41%  du       libc-2.22.so       [.] _wordcopy_bwd_dest_aligned
     1.19%  du       du                 [.] 0x000000000000e644
     0.93%  du       libc-2.22.so       [.] __fxstatat64
     0.85%  du       libc-2.22.so       [.] do_fcntl
     0.73%  du       [kernel.kallsyms]  [k] page_fault
[lots more symbols, almost all in du...]

cycles:uk

# perf record -e cycles:uk du -sh ~
179G    /home/khouli
[ perf record: Woken up 1 times to write data ]
[ext4] with build id 0f47443e26a238299e8a5963737da23dd3530376 not found,
continuing without symbols
[ perf record: Captured and wrote 0.120 MB perf.data (2856 samples) ]
# perf report --stdio -i perf.data
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 2K of event 'cycles:uk'
# Event count (approx.): 3118065867
#
# Overhead  Command  Shared Object      Symbol
# ........  .......  .................  ..............................................
#
    13.80%  du       [kernel.kallsyms]  [k] __d_lookup_rcu
     6.16%  du       [kernel.kallsyms]  [k] security_inode_getattr
     2.52%  du       [kernel.kallsyms]  [k] str2hashbuf_signed
     2.43%  du       [kernel.kallsyms]  [k] system_call
     2.35%  du       [kernel.kallsyms]  [k] half_md4_transform
     2.31%  du       [kernel.kallsyms]  [k] ext4_htree_store_dirent
     1.97%  du       [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
     1.96%  du       libc-2.22.so       [.] _int_malloc
     1.93%  du       du                 [.] fts_read
     1.90%  du       [kernel.kallsyms]  [k] system_call_after_swapgs
     1.83%  du       libc-2.22.so       [.] _int_free
     1.44%  du       [kernel.kallsyms]  [k] link_path_walk
     1.33%  du       libc-2.22.so       [.] __memmove_sse2
     1.19%  du       [kernel.kallsyms]  [k] _raw_spin_lock
     1.19%  du       [kernel.kallsyms]  [k] __fget_light
     1.12%  du       [kernel.kallsyms]  [k] kmem_cache_alloc
     1.12%  du       [kernel.kallsyms]  [k] __ext4_check_dir_entry
     1.05%  du       [kernel.kallsyms]  [k] lockref_get_not_dead
     1.02%  du       [kernel.kallsyms]  [k] generic_fillattr
     0.95%  du       [kernel.kallsyms]  [k] do_dentry_open
     0.95%  du       [kernel.kallsyms]  [k] path_init
     0.95%  du       [kernel.kallsyms]  [k] lockref_put_return
     0.91%  du       libc-2.22.so       [.] do_fcntl
     0.91%  du       [kernel.kallsyms]  [k] ext4_getattr
     0.91%  du       [kernel.kallsyms]  [k] rb_insert_color
     0.88%  du       [kernel.kallsyms]  [k] __kmalloc
     0.88%  du       libc-2.22.so       [.] __readdir64
     0.88%  du       libc-2.22.so       [.] malloc
     0.84%  du       [kernel.kallsyms]  [k] ext4fs_dirhash
     0.84%  du       [kernel.kallsyms]  [k] __slab_free
     0.84%  du       [kernel.kallsyms]  [k] in_group_p
     0.81%  du       [kernel.kallsyms]  [k] get_empty_filp
     0.77%  du       libc-2.22.so       [.] malloc_consolidate
[more...]

cycles:k

# perf record -e cycles:k du -sh ~
179G    /home/khouli
[ perf record: Woken up 1 times to write data ]
[ext4] with build id 0f47443e26a238299e8a5963737da23dd3530376 not found, continuing
without symbols
[ perf record: Captured and wrote 0.118 MB perf.data (2816 samples) ]
# perf report --stdio -i perf.data
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 2K of event 'cycles:k'
# Event count (approx.): 2438426748
#
# Overhead  Command  Shared Object      Symbol
# ........  .......  .................  ..............................................
#
    17.11%  du       [kernel.kallsyms]  [k] __d_lookup_rcu
     6.97%  du       [kernel.kallsyms]  [k] security_inode_getattr
     4.22%  du       [kernel.kallsyms]  [k] half_md4_transform
     3.10%  du       [kernel.kallsyms]  [k] str2hashbuf_signed
     3.01%  du       [kernel.kallsyms]  [k] system_call_after_swapgs
     2.59%  du       [kernel.kallsyms]  [k] ext4_htree_store_dirent
     2.24%  du       [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
     2.14%  du       [kernel.kallsyms]  [k] lockref_get_not_dead
     1.86%  du       [kernel.kallsyms]  [k] ext4_getattr
     1.85%  du       [kernel.kallsyms]  [k] kfree
     1.68%  du       [kernel.kallsyms]  [k] __ext4_check_dir_entry
     1.53%  du       [kernel.kallsyms]  [k] __fget_light
     1.34%  du       [kernel.kallsyms]  [k] link_path_walk
     1.34%  du       [kernel.kallsyms]  [k] path_init
     1.22%  du       [kernel.kallsyms]  [k] __kmalloc
     1.22%  du       [kernel.kallsyms]  [k] kmem_cache_alloc
     1.14%  du       [kernel.kallsyms]  [k] do_dentry_open
     1.11%  du       [kernel.kallsyms]  [k] ext4_readdir
     1.07%  du       [kernel.kallsyms]  [k] __find_get_block_slow
     1.07%  du       libc-2.22.so       [.] do_fcntl
     1.04%  du       [kernel.kallsyms]  [k] _raw_spin_lock
     0.99%  du       [kernel.kallsyms]  [k] _raw_read_lock
     0.95%  du       libc-2.22.so       [.] __fxstatat64
     0.94%  du       [kernel.kallsyms]  [k] rb_insert_color
     0.94%  du       [kernel.kallsyms]  [k] generic_fillattr
     0.93%  du       [kernel.kallsyms]  [k] ext4fs_dirhash
     0.93%  du       [kernel.kallsyms]  [k] find_get_entry
     0.89%  du       [kernel.kallsyms]  [k] rb_next
     0.89%  du       [kernel.kallsyms]  [k] is_dx_dir
     0.89%  du       [kernel.kallsyms]  [k] in_group_p
     0.89%  du       [kernel.kallsyms]  [k] cp_new_stat
   [more...]

perf_event_paranoid

$ cat /proc/sys/kernel/perf_event_paranoid
1

kernel config for perf

$ cat /proc/config.gz | gunzip | grep -A70 'Kernel Perf'
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
CONFIG_SLUB_CPU_PARTIAL=y
CONFIG_SYSTEM_TRUSTED_KEYRING=y
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_OPROFILE=m
# CONFIG_OPROFILE_EVENT_MULTIPLEX is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_UPROBES=y
# CONFIG_HAVE_64BIT_ALIGNED_ACCESS is not set
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_CLK=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_HAVE_CC_STACKPROTECTOR=y
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_NONE is not set
# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
CONFIG_CC_STACKPROTECTOR_STRONG=y
CONFIG_HAVE_CONTEXT_TRACKING=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y
like image 838
Praxeolitic Avatar asked Oct 12 '15 17:10

Praxeolitic


People also ask

What is perf used for?

Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences in Linux performance measurements and presents a simple commandline interface. Perf is based on the perf_events interface exported by recent versions of the Linux kernel.

What are perf events?

perf_events provides a command line tool, perf, and subcommands for various profiling activities. This is a single interface for the different instrumentation frameworks that provide the various events. The perf command alone will list the subcommands; here is perf version 4.10 (for the Linux 4.10 kernel):

What is perf kernel?

perf (sometimes called perf_events or perf tools, originally Performance Counters for Linux, PCL) is a performance analyzing tool in Linux, available from Linux kernel version 2.6. 31 in 2009.

How does perf work in Linux?

Perf works on the Model Specific Registers of your CPU for measurements like cycles or branch-misses or so. A special Part called PMU(Performance Measurement Unit) is counting all kinds of events.


1 Answers

I understand your question to be: Why does perf for user mode recording show values from inside the kernel? Well, it's doing exactly what it's supposed to do, from a "system accounting" standpoint.

You did: perf record -e cycles:u du -sh ~ and you got stats on system_call and page_fault and you're wondering why that happened?

When you did the du, it had to traverse the file system. In doing so, it issued system calls for things it needed (e.g. open, readdir, etc.). du initiated these things for you, so it got "charged back" for them. Likewise, du page faulted a number of times.

perf is keeping track of any activity caused by a given process/program, even if it happens inside kernel address space. In other words, the program requested the activity, and the kernel performed it at the program's behest, so it gets charged appropriately. The kernel had to do "real work" to do FS I/O and/or resolve page faults, so you must "pay for the work you commissioned". Anything that a given program does that consumes system resources gets accounted for.

This is the standard accounting model for computer systems, dating back to the 1960's when people actually rented out time on mainframe computers. You got charged for everything you did [just like a lawyer :-)], directly or indirectly.
* charge per minute of connect time
* charge per cpu cycle consumed in user program
* charge per cpu cycle executed for program in kernel space
* charge for each network packet sent/received
* charge for any page fault caused by your program
* charge for each disk block read/written, either to a file or to the paging/swap disk
At the end of the month, they mailed you an itemized bill [just like a utility bill], and you had to pay:
Real money.

Note that there are some things that will not be charged for. For example, let's assume your program is compute bound but does not do [much] I/O and uses a relatively small amount of memory (i.e. it does not cause a page fault on its own). The program will get charged for user space CPU usage.

The OS may have to swap out (i.e. steal) one or more of your pages to make room for some other memory hog program. After the hog runs, your program will run again. Your program will need to fault back in the page or pages that were stolen from it.

Your program will not be charged for these because your program did not cause the page fault. In other words, for every page "stolen" from you, you're given a "credit" for that page when your program has to fault it back in.

Also, when trying to run a different process, the kernel does not charge the CPU time consumed by its process scheduler to any process. This is considered "overhead" and/or standard operating costs. For example, if you have a checking account with a bank, they don't charge you for the upkeep costs on the local branch office that you visit.

So perf, while useful to measure performance, it is using an accounting model to get the data.

It's like a car. You can drive your car to the store and pick up something, and you will consume some gasoline. Or, you can ask a friend to drive your car to the store. In either case, you have to pay for the gasoline because you drove the car, or because [when the friend drove the car] the gasoline was consumed when doing something for you. In this case, the kernel is your friend :-)

UPDATE:

My source for this is the source [kernel source]. And, I've been doing kernel programming for 40 years.

There are two basic types of perf counters. "Macro" ones such as page fault, that the kernel can generate. Others are the syscall counter.

The other time is the "micro" or "nano" type. These come from x86's PMC arch, and have counters for things like "cache miss", "branch mispredict", "data fetch mispredict", etc. that the kernel can't compute.

The PMC counters just free run. That's why you get your global stats, regardless of what recording mode you're doing. The kernel can interrogate them periodically, but it can't get control every time a PMC is incremented. Want the global/system-wide and/or per-CPU values for these? Just execute the appropriate RDPMC instruction.

To keep track of PMC for a process, when you start a process, do RDPMC and save the value in the task struct [for as many that are marked "of interest"] as "PMC value at start". When the given CPU core is rescheduled, the scheduler computes the "next" task, the scheduler gets the current PMC value, takes the difference between it and one it stored in the "old" task block when it started that task, and bumps up that task's "total count" for that PMC. The "current value" becomes the new task's "PMC value at start"

In Linux, when a task/context switch occurs, it generates two perf events, one for "entering new task on cpu X" and "stopping old task on cpu X".

Your question was why monitoring for "user mode" produced kernel addresses. That's because when recording (and this is not the perf program), it stores the temp data [as mentioned above] in the current task/context block, until a task switch actually occurs.

The key thing to note is that this context does not change simply because a syscall was executed--only when a context switch occurs. For example, the gettimeofday syscall justs gets wall clock time and returns it to user space. It does not do a context switch, so any perf event that it kicks off will be charged to active/current context. It doesn't matter whether it comes from kernel space or user space.

As a further example, suppose the process does a file read syscall. In traversing the file handle data, inode, etc. it may generate several perf events. Also, it will probably generate a few more cache misses and other PMC counter bumps. If the desired block is already in the FS block cache, the syscall will just do a copy_to_user and then reenter user space. No expensive context switch with the above PMC difference calculations as the pmc_value_at_start is still valid.

One of the reasons that it's done this way is performance [of the perf mechanism]. If you did the PMC save/restore immediately upon crossing to kernel space after a syscall starts [to separate kernel stats from user stats for a given process, as you'd like], the overhead would be enormous. You wouldn't be performance measuring the base kernel. You'd be performance measuring the kernel + a lot of perf overhead.

When I had to do performance analysis of a commercial hard realtime system based on Linux, I developed my own performance logging system. The system had 8 CPU cores, interacting with multiple custom hardware boards on the PCIe bus with multiple FPGAs. The FPGAs also had custom firmware running inside a Microblaze. Event logs from user space, kernel space, and microblaze could all be time coordinated to nanosecond resolution and the time to store an event record was 70ns.

To me, Linux's perf mechanism is a bit crude and bloated. If one were to use it to try and troubleshoot a performance/timing bug that involved race conditions, possible lock/unlock problems, etc. it might be problematic. That is, running the system without perf and you get the bug. Turn on perf, and you don't because you've changed the fundamental characteristic timing of the system. Turn perf off, and timing bug reappears.

like image 138
Craig Estey Avatar answered Oct 30 '22 19:10

Craig Estey