Linux perf-tools are great for finding hotspots in CPU cycles and optimizing those hotspots. But once some parts are parallelized it becomes difficult to spot the sequential parts since they take up significant wall time but not necessarily many CPU cycles (the parallel parts are already burning those). To avoid the XY-problem: My underlying motivation is to find sequential bottlenecks in multi-threaded code. The parallel phases can easily dominate the aggregate CPU-cycle statistics even though the sequential phases dominate wall time due to amdahl's law. For java applications this is fairly easy to achieve with visualvm or yourkit which have a thread-utilization timelines. <img src="https://i.stack.imgur.com/2L0QT.png" alt="yourkit thread timeline"> Note that it shows both thread state (runnable, waiting, blocked) and stack samples for selected ranges or points in time. How do I achieve something comparable with perf or other native profilers on linux? It doesn't have to be a GUI visualization, just a way to find sequential bottlenecks and CPU samples associated with them. See also, a more narrow followup question focusing on <code>perf</code>.

Oracle's Developer Studio Performance Analyzer might do exactly what you're looking for. (Were you running on Solaris, I know it would do exactly what you're looking for, but I've never used it on Linux, and I don't have access right now to a Linux system suitable to try it on). This is a screenshot of a multithreaded IO test program, running on an x86 Solaris 11 system: <img src="https://i.stack.imgur.com/ICWMD.png" alt="Screenshot of multithreaded IO performance test prorgam"> Note that you can see the call stack of every thread along with seeing exactly how the threads interact - in the posted example, you can see where the threads that actually perform the IO start, and you can see each of the threads as they perform. This is a view that shows exactly where thread 2 is at the highlighted moment: <img src="https://i.stack.imgur.com/hTQGM.png" alt="enter image description here"> This view has synchronization event view enabled, showing that thread 2 is stuck in a <code>sem_wait</code> call for the highlighted period. Note the additional rows of graphical data, showing the synchronization events (<code>sem_wait()</code>, <code>pthread_cond_wait()</code>, <code>pthread_mutex_lock()</code> etc): <img src="https://i.stack.imgur.com/dsNbD.png" alt="enter image description here"> Other views include a call tree: <img src="https://i.stack.imgur.com/4aWGB.png" alt="enter image description here"> a thread overview (not very useful with only a handful of threads, but likely very useful if you have hundreds or more <img src="https://i.stack.imgur.com/PNKY0.png" alt="enter image description here"> and a view showing function CPU utilization <img src="https://i.stack.imgur.com/f87r7.png" alt="enter image description here"> And you can see how much time is spent on each line of code: <img src="https://i.stack.imgur.com/OO9fn.png" alt="enter image description here"> Unsurprisingly, a process that's writing a large file to test IO performance spent almost all its time in the <code>write()</code> function. The full Oracle brief is at https://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf Quick usage overview: <ul> <li>collect performance data using the <code>collect</code> utility. See https://docs.oracle.com/cd/E77782_01/html/E77798/afadm.html#scrolltoc </li> <li>Start the <code>analyzer</code> GUI to analyze the data collected above.</li> </ul>

You can get the result you want using a great tool we use to analyze Off-CPU Analysis - Off-CPU Flame Graphs which is apart of Flame Graphs I used the Off-CPU analysis <blockquote> Off-CPU analysis is a performance methodology where off-CPU time is measured and studied, along with context such as stack traces. It differs from CPU profiling, which only examines threads if they are executing on-CPU. </blockquote> This tool is based on the tools you mentioned as the preferred ones - perf, bcctools, however, it provides a really easy to use output called flame graph which interactive SVG file looks like this SVG Off-CPU Time Flame Graph. <img src="https://i.stack.imgur.com/9mEyA.png" alt="enter image description here"> <blockquote> The width is proportional to the total time in the code paths, so look for the widest towers first to understand the biggest sources of latency. The left-to-right ordering has no meaning, and the y-axis is the stack depth. </blockquote> 2 more helpful analysis which are part of the Off-CPU Flame Graphs can also help you - Personally, I did not tried them. Wakeup <blockquote> This lets us solve more problems than off-CPU tracing alone, as the wakeup information can explain the real reason for blocking. </blockquote> And Chain Graph <blockquote> Chain graphs are an experimental visualization that associates off-CPU stacks with their wakeup stacks </blockquote> There is also an experimental visualization which combines both CPU and Off-CPU flame graphs Hot/Cold Flame Graphs <blockquote> This shows all thread time in one graph, and allows direct comparisons between on- and off-CPU code path durations. </blockquote> It requires a little time to read about this profiling tool and understands its concepts, however, using it is super easy and its output is easier to analyze than other tools you mentioned above. Good Luck!

Thread Utilization profiling on linux

Tags:

performance

linux

multithreading

profiling

perf

Linux perf-tools are great for finding hotspots in CPU cycles and optimizing those hotspots. But once some parts are parallelized it becomes difficult to spot the sequential parts since they take up significant wall time but not necessarily many CPU cycles (the parallel parts are already burning those).

To avoid the XY-problem: My underlying motivation is to find sequential bottlenecks in multi-threaded code. The parallel phases can easily dominate the aggregate CPU-cycle statistics even though the sequential phases dominate wall time due to amdahl's law.

For java applications this is fairly easy to achieve with visualvm or yourkit which have a thread-utilization timelines.

yourkit thread timeline

Note that it shows both thread state (runnable, waiting, blocked) and stack samples for selected ranges or points in time.

How do I achieve something comparable with perf or other native profilers on linux? It doesn't have to be a GUI visualization, just a way to find sequential bottlenecks and CPU samples associated with them.

See also, a more narrow followup question focusing on perf.

855

asked Jul 22 '17 05:07

the8472

2 Answers

Oracle's Developer Studio Performance Analyzer might do exactly what you're looking for. (Were you running on Solaris, I know it would do exactly what you're looking for, but I've never used it on Linux, and I don't have access right now to a Linux system suitable to try it on).

This is a screenshot of a multithreaded IO test program, running on an x86 Solaris 11 system:

Screenshot of multithreaded IO performance test prorgam

Note that you can see the call stack of every thread along with seeing exactly how the threads interact - in the posted example, you can see where the threads that actually perform the IO start, and you can see each of the threads as they perform.

This is a view that shows exactly where thread 2 is at the highlighted moment:

enter image description here

This view has synchronization event view enabled, showing that thread 2 is stuck in a sem_wait call for the highlighted period. Note the additional rows of graphical data, showing the synchronization events (sem_wait(), pthread_cond_wait(), pthread_mutex_lock() etc):

enter image description here

Other views include a call tree:

enter image description here

a thread overview (not very useful with only a handful of threads, but likely very useful if you have hundreds or more

enter image description here

and a view showing function CPU utilization

enter image description here

And you can see how much time is spent on each line of code:

enter image description here

Unsurprisingly, a process that's writing a large file to test IO performance spent almost all its time in the write() function.

The full Oracle brief is at https://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf

Quick usage overview:

collect performance data using the collect utility. See https://docs.oracle.com/cd/E77782_01/html/E77798/afadm.html#scrolltoc
Start the analyzer GUI to analyze the data collected above.

163

answered Oct 21 '22 17:10

Andrew Henle

You can get the result you want using a great tool we use to analyze Off-CPU Analysis - Off-CPU Flame Graphs which is apart of Flame Graphs

I used the Off-CPU analysis

Off-CPU analysis is a performance methodology where off-CPU time is measured and studied, along with context such as stack traces. It differs from CPU profiling, which only examines threads if they are executing on-CPU.

This tool is based on the tools you mentioned as the preferred ones - perf, bcctools, however, it provides a really easy to use output called flame graph which interactive SVG file looks like this SVG Off-CPU Time Flame Graph.

enter image description here

The width is proportional to the total time in the code paths, so look for the widest towers first to understand the biggest sources of latency. The left-to-right ordering has no meaning, and the y-axis is the stack depth.

2 more helpful analysis which are part of the Off-CPU Flame Graphs can also help you - Personally, I did not tried them.

Wakeup

This lets us solve more problems than off-CPU tracing alone, as the wakeup information can explain the real reason for blocking.

And Chain Graph

Chain graphs are an experimental visualization that associates off-CPU stacks with their wakeup stacks

There is also an experimental visualization which combines both CPU and Off-CPU flame graphs Hot/Cold Flame Graphs

This shows all thread time in one graph, and allows direct comparisons between on- and off-CPU code path durations.

It requires a little time to read about this profiling tool and understands its concepts, however, using it is super easy and its output is easier to analyze than other tools you mentioned above.

Good Luck!

answered Oct 21 '22 15:10

Gal S

Related questions
                            
                                How to decompile a ELF 32-bit LSB executable?
                            
                                Connect to SQL Server from Linux via JDBC using integratedSecurity (Windows authentication)?
                            
                                ipython notebook kernel dies ("WebSocket ping timeout") when the SSH connection becomes idle
                            
                                Cancel failed reverse-i-search in bash but keep what I typed in
                            
                                Boost.asio & UNIX signal handling
                            
                                Can I target older linux with newer gcc/clang? C++
                            
                                Stored procedure is not returning data
                            
                                How to fix Emulator: qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.abm [bit 5]
                            
                                Alpine apk: List all available package versions
                            
                                Mount native ext4 partition in WSL2 [closed]
                            
                                Suppressing Valgrind errors from GTK+
                            
                                Is there an equivalent to 'adb shell input keyboard text' for iOS?
                            
                                What is the accuracy of interval timers in Linux?
                            
                                How to mmap() a large file without risking the OOM killer?
                            
                                The XBox 360 TCP stack does not respond to TCP Zero Window Probes with a 0-byte payload
                            
                                Multi-touch cross platform java application (Windows, Mac, and Linux(Ubuntu)(possibly Android))
                            
                                Authenticating GTK app to run with root permissions
                            
                                How to enter greek alpha under Xorg?
                            
                                Java process's memory grows indefinitely, but MemoryMXBean reports stable heap and non-heap size
                            
                                Why can recv() in the client program receive messages sent to the client after the client has invoked shutdown(sockfd, SHUT_RD)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Thread Utilization profiling on linux

Tags:

performance

linux

multithreading

profiling

perf

the8472

People also ask

2 Answers

Andrew Henle

Gal S

Recent Activity

Donate For Us