Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does my application spend 24% of its life doing a null check?

I've got a performance critical binary decision tree, and I'd like to focus this question on a single line of code. The code for the binary tree iterator is below with the results from running performance analysis against it.

        public ScTreeNode GetNodeForState(int rootIndex, float[] inputs)         { 0.2%        ScTreeNode node = RootNodes[rootIndex].TreeNode;  24.6%       while (node.BranchData != null)             { 0.2%            BranchNodeData b = node.BranchData; 0.5%            node = b.Child2; 12.8%           if (inputs[b.SplitInputIndex] <= b.SplitValue) 0.8%                node = b.Child1;             }  0.4%        return node;         } 

BranchData is a field, not a property. I did this to prevent the risk of it not being inlined.

The BranchNodeData class is as follows:

public sealed class BranchNodeData {     /// <summary>     /// The index of the data item in the input array on which we need to split     /// </summary>     internal int SplitInputIndex = 0;      /// <summary>     /// The value that we should split on     /// </summary>     internal float SplitValue = 0;      /// <summary>     /// The nodes children     /// </summary>     internal ScTreeNode Child1;     internal ScTreeNode Child2; } 

As you can see, the while loop / null check is a massive hit on performance. The tree is massive, so I would expect searching for a leaf to take a while, but I'd like to understand the disproportionate amount of time spent on that one line.

I've tried:

  • Separating the Null check from the while - it's the Null check that's the hit.
  • Adding a boolean field to the object and checking against that, it made no difference. It doesn't matter what's being compared, it's the comparison that's the issue.

Is this a branch prediction issue? If so, what can I do about it? If anything?

I won't pretend to understand the CIL, but I'll post it for anyone does so they can try to scrape some information from it.

.method public hidebysig instance class OptimalTreeSearch.ScTreeNode GetNodeForState (     int32 rootIndex,     float32[] inputs ) cil managed {     // Method begins at RVA 0x2dc8     // Code size 67 (0x43)     .maxstack 2     .locals init (         [0] class OptimalTreeSearch.ScTreeNode node,         [1] class OptimalTreeSearch.BranchNodeData b     )      IL_0000: ldarg.0     IL_0001: ldfld class [mscorlib]System.Collections.Generic.List`1<class OptimalTreeSearch.ScRootNode> OptimalTreeSearch.ScSearchTree::RootNodes     IL_0006: ldarg.1     IL_0007: callvirt instance !0 class [mscorlib]System.Collections.Generic.List`1<class OptimalTreeSearch.ScRootNode>::get_Item(int32)     IL_000c: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.ScRootNode::TreeNode     IL_0011: stloc.0     IL_0012: br.s IL_0039     // loop start (head: IL_0039)         IL_0014: ldloc.0         IL_0015: ldfld class OptimalTreeSearch.BranchNodeData OptimalTreeSearch.ScTreeNode::BranchData         IL_001a: stloc.1         IL_001b: ldloc.1         IL_001c: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.BranchNodeData::Child2         IL_0021: stloc.0         IL_0022: ldarg.2         IL_0023: ldloc.1         IL_0024: ldfld int32 OptimalTreeSearch.BranchNodeData::SplitInputIndex         IL_0029: ldelem.r4         IL_002a: ldloc.1         IL_002b: ldfld float32 OptimalTreeSearch.BranchNodeData::SplitValue         IL_0030: bgt.un.s IL_0039          IL_0032: ldloc.1         IL_0033: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.BranchNodeData::Child1         IL_0038: stloc.0          IL_0039: ldloc.0         IL_003a: ldfld class OptimalTreeSearch.BranchNodeData OptimalTreeSearch.ScTreeNode::BranchData         IL_003f: brtrue.s IL_0014     // end loop      IL_0041: ldloc.0     IL_0042: ret } // end of method ScSearchTree::GetNodeForState 

Edit: I decided to do a branch prediction test, I added an identical if within the while, so we have

while (node.BranchData != null) 

and

if (node.BranchData != null) 

inside that. I then ran performance analysis against that, and it took six times longer to execute the first comparison as it did to execute the second comparison that always returned true. So it looks like it is indeed a branch prediction issue - and I'm guessing there's nothing I can do about it?!

Another Edit

The above result would also occur if node.BranchData had to be loaded from the RAM for the while check - it would then be cached for the if statement.


This is my third question on a similar topic. This time I'm focusing on a single line of code. My other questions on this subject are:

  • Could I use a faster data structure than a tree for this?
  • Micro optimisations iterating through a tree in C#
like image 661
Will Calderwood Avatar asked May 15 '13 09:05

Will Calderwood


1 Answers

The tree is massive

By far the most expensive thing a processor ever does is not executing instructions, it is accessing memory. The execution core of a modern CPU is many times faster than the memory bus. A problem related to distance, the further an electrical signal has to travel, the harder it gets to get that signal delivered to the other end of the wire without it being corrupted. The only cure for that problem is to make it go slower. A big problem with the wires that connect the CPU to the RAM in your machine, you can pop the case and see the wires.

Processors have a counter-measure for this problem, they use caches, buffers that store a copy of the bytes in RAM. An important one is the L1 cache, typically 16 kilobytes for data and 16 kilobytes for instructions. Small, allowing it to be close to the execution engine. Reading bytes from the L1 cache typically takes 2 or 3 CPU cycles. Next up is the L2 cache, bigger and slower. Upscale processors also have an L3 cache, bigger and slower yet. As process technology improves, those buffers take less space and automatically becomes faster as they get closer to the core, a big reason why newer processors are better and how they manage to use an ever increasing number of transistors.

Those caches are however not a perfect solution. The processor will still stall on a memory access if the data is not available in one of the caches. It cannot continue until the very slow memory bus has supplied the data. Losing a fat hundred CPU cycles is possible on a single instruction.

Tree structures are a problem, they are not cache friendly. Their nodes tend to be scattered throughout the address space. The fastest way to access memory is by reading from sequential addresses. The unit of storage for the L1 cache is 64 bytes. Or in other words, once the processor reads one byte, the next 63 are very fast since they'll be present in the cache.

Which makes an array by far the most efficient data structure. Also the reason that the .NET List<> class isn't a list at all, it uses an array for storage. The same for other collection types, like Dictionary, structurally not remotely similar to an array, but internally implemented with arrays.

So your while() statement is very likely to be suffering from CPU stalls because it is dereferencing a pointer to access the BranchData field. The next statement is very cheap because the while() statement already did the heavy lifting of retrieving the value from memory. Assigning the local variable is cheap, a processor uses a buffer for writes.

Not otherwise a simple problem to solve, flattening your tree into arrays is very likely to be unpractical. Not in the least because you typically cannot predict in which order the nodes of the tree will be visited. A red-black tree might help, it isn't clear from the question. So a simple conclusion to draw is that it is already running as fast you can hope for. And if you need it to go faster then you'll need better hardware with a faster memory bus. DDR4 is going mainstream this year.

like image 87
Hans Passant Avatar answered Sep 21 '22 03:09

Hans Passant