What is the overhead of using Intel Last Branch Record?

Question

Last Branch Record refers to a collection of register pairs (MSRs) that store the source and destination addresses related to recently executed branches. http://css.csail.mit.edu/6.858/2012/readings/ia32/ia32-3b.pdf document has more information in case you are interested.

a) Can someone give an idea of how much LBR slows down program execution of common programs - both CPU and IO intensive ?
b) Will branch prediction be turned OFF when LBR tracing is ON ?

osgx · Accepted Answer

The paper Intel Code Execution Trace Resources (by Arium workers, Craig Pedersen and Jeff Acampora, Apr 29, 2012 ) lists three variants of branch tracing:

Last Branch Record (LBR) flag in the DebugCtlMSR and corresponding LastBranchToIP and LastBranchFromIP MSRs as well as LastExceptionToIP and LastExceptionFromIP MSRs.
Branch Trace Store (BTS) using either cache-as-RAM or system DRAM.
Architecture Event Trace (AET) captured off the XDP port and stored externally in a connected In-Target Probe.

As said in page 2, LBR save information in MSRs, "does not impede any real-time performance," but is useful only for very short code ("effective trace display is very shallow and typically may only show hundreds of instructions."). Only saves info about 4-16 branches.

BTS allows to capture many pairs of branch "From"s and "To"s, and stores them in cache (Cache-as-RAM, CAR) or in system DRAM. In case of CAR, trace depth/length is limited by cache sizes (and some constant); with DRAM trace length is almost unlimited. The paper estimates overhead of BTS as from 20 up to 100 percents due to additional memory stores. BTS on Linux is easy to use with proposed perf branch record (not yet in vanilla) or btrax project. perf branch presentation gives some hints about BTS organisation: there is BTS buffer, which contains "from", "to" fields, and "predicted flag". So, branch prediction is not turned off when using BTS. Also, when BTS buffer is filled up to max size, interrupt is generated. BTS-handling module in kernel (perf_events subsystem or btrax kernel module) should copy data from BTS buffer to other location in case of such interrupt.

So, in BTS mode there are two sources of overhead: Cache/Memory stores and interrupts from BTS buffer overflow.

AET uses external agent to save debug and trace data. This agent is connected via eXtended Debug Port (XDP) and interfaces with In-Target Probe (ITP). Overhead of AET "can have a significant effect on system performance, which can be several orders of magnitude greater" according to this paper, because AET can generate/capture more types of events. But the collected data storage is external to debugged platform.

Paper's "Summary" says: 

LBR has no overhead, but is very shallow (4–16 branch locations, depending on the CPU). Trace data is available immediately out of reset.

BTS is much deeper, but has an impact on CPU performance and requires on-board RAM. Trace data is available as soon as CAR is initialized.

AET requires special ITP hardware and is not available on all CPU architectures. It has the advantage of storing the trace data off board.

Sirmabus · Answer

This is an old question (with an old answer too) but it does come up in searches today.

In 2021 what you want to use for hardware tracing is Intel® Processor Trace (IPT).
Keep in mind the question is obviously about Intel/AMD desktop CPUs. AFAIK there is similar solutions for ARM CPUs, not covered here.

I've used both LBR and IPT setups in Windows using custom drivers, and the later is by far the least amount of overhead. Somewhere in the two digits or less percentage wise slowdown doing a process trace.

Also in the answer saying:

LBR has no overhead,..

Is technically true, but impractical to say because the overhead comes when actually reading the store registers. Typically you will set it up to interrupt on every branch record. So we are talking about the overhead to handle an interrupt/exception/trap for every single branch (call, jmp, jcc, int, etc.) instruction that has a thread active via the trap/single-step flag.

The biggest downside to IPT is that is available only on Intel CPUs while the LBR feature is supported by AMD CPUs too.

Also unfortunately AFAIK (last time I checked) the IPT feature is not supported by any commercial VM software yet. Which means you will more than likely be able to only do an IPT session on direct hardware. Not a big deal unless you really wanted to do your tracing in a VM. For that matter LBR might have the same limitation.

Some Linuxes have native kernel support for IPT. A good starting point for Windows is Alex Ionescu's WinIPT project:
https://ionescu007.github.io/winipt/

What is the overhead of using Intel Last Branch Record?

Tags:

branch-prediction

x86

trace

intel

intel-pmu

user655617

2 Answers

osgx

Sirmabus

Recent Activity

Donate For Us

What is the overhead of using Intel Last Branch Record?

Tags:

branch-prediction

x86

trace

intel

intel-pmu

user655617

2 Answers

osgx

Sirmabus

Related questions

Recent Activity

Donate For Us