A very simple sample application (.NET 4.6.2) produces a StackOverflowException at a recursion depth of 12737, which reduces to a recursion depth of 10243, if the most inner function call throws an exception, which is expected and OK.
If i use a Lazy<T>
to shortly hold an intermediate result, the StackOverflowException already occurs at a recursion depth of 2207, if no exception is thrown and at a recursion depth of 105, if an exception is thrown.
Note: The StackOverflowException at a depth of 105 is only observable if compiled to x64. With x86 (32-Bit), the effect first occurs at a depth of 4272. Mono (like used at https://repl.it) works without problems up to a depth of 74200.
The StackOverflowException does not occur within the deep recursion, but while ascending back to the main routine. The finally block is processed up some depth, then the program dies:
Exception System.InvalidOperationException at 105
Finally at 105
...
Exception System.InvalidOperationException at 55
Finally at 55
Exception System.InvalidOperationException at 54
Finally at 54
Process is terminated due to StackOverflowException.
or within the debugger:
The program '[xxxxx] Test.vshost.exe' has exited with code -2147023895 (0x800703e9).
Who might be able to explain this?
public class Program
{
private class Test
{
private int maxDepth;
private int CalculateWithLazy(int depth)
{
try
{
var lazy = new Lazy<int>(() => this.Calculate(depth));
return lazy.Value;
}
catch (Exception e)
{
Console.WriteLine("Exception " + e.GetType() + " at " + depth);
throw;
}
finally
{
Console.WriteLine("Finally at " + depth);
}
}
private int Calculate(int depth)
{
if (depth >= this.maxDepth) throw new InvalidOperationException("Max. recursion depth reached.");
return this.CalculateWithLazy(depth + 1);
}
public void Run()
{
for (int i = 1; i < 100000; i++)
{
this.maxDepth = i;
try
{
Console.WriteLine("MaxDepth: " + i);
this.CalculateWithLazy(0);
}
catch { /* ignore */ }
}
}
}
public static void Main(string[] args)
{
var test = new Test();
test.Run();
Console.Read();
}
Update: The problem can be reproduced without the usage of Lazy<T>
, just by having a try-catch-throw block in the recursive method.
[MethodImpl(MethodImplOptions.NoInlining)]
private int Calculate(int depth)
{
try
{
if (depth >= this.maxDepth) throw new InvalidOperationException("Max. recursion depth reached.");
return this.Calculate2(depth + 1);
}
catch
{
throw;
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private int Calculate2(int depth) // just to prevent the compiler from tail-recursion-optimization
{
return this.Calculate(depth);
}
public void Run()
{
for (int i = 1; i < 100000; i++)
{
this.maxDepth = i;
try
{
Console.WriteLine("MaxDepth: " + i);
this.Calculate(0);
}
catch(Exception e)
{
Console.WriteLine("Finished with " + e.GetType());
}
}
}
The problem can be reproduced without the usage of
Lazy<T>
, just by having a try-catch-throw block in the recursive method.
You've already noticed the source of the behavior. Now let me explain why, as this makes no sense, right?
It makes no sense, because the exception is catched and then immediately rethrown, so the stack should shrink, right?
The following test:
internal class Program
{
private int _maxDepth;
[MethodImpl(MethodImplOptions.NoInlining)]
private int Calculate(int depth)
{
try
{
Console.WriteLine("In try at depth {0}: stack frame count = {1}", depth, new StackTrace().FrameCount);
if (depth >= _maxDepth)
throw new InvalidOperationException("Max. recursion depth reached.");
return Calculate2(depth + 1);
}
catch
{
Console.WriteLine("In catch at depth {0}: stack frame count = {1}", depth, new StackTrace().FrameCount);
throw;
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private int Calculate2(int depth) => Calculate(depth);
public void Run()
{
try
{
_maxDepth = 10;
Calculate(0);
}
catch (Exception e)
{
Console.WriteLine("Finished with " + e.GetType());
}
}
public static void Main() => new Program().Run();
}
Seems to validate that hypothesis:
In try at depth 0: stack frame count = 3
In try at depth 1: stack frame count = 5
In try at depth 2: stack frame count = 7
In try at depth 3: stack frame count = 9
In try at depth 4: stack frame count = 11
In try at depth 5: stack frame count = 13
In try at depth 6: stack frame count = 15
In try at depth 7: stack frame count = 17
In try at depth 8: stack frame count = 19
In try at depth 9: stack frame count = 21
In try at depth 10: stack frame count = 23
In catch at depth 10: stack frame count = 23
In catch at depth 9: stack frame count = 21
In catch at depth 8: stack frame count = 19
In catch at depth 7: stack frame count = 17
In catch at depth 6: stack frame count = 15
In catch at depth 5: stack frame count = 13
In catch at depth 4: stack frame count = 11
In catch at depth 3: stack frame count = 9
In catch at depth 2: stack frame count = 7
In catch at depth 1: stack frame count = 5
In catch at depth 0: stack frame count = 3
Finished with System.InvalidOperationException
Well... no. It's not that simple.
.NET exceptions are built on top of Windows Structured Exception Handling (SEH), which is a complicated beast.
There are two articles you need to read if you want to know the details. They're old, but the parts relevant to your question are still accurate:
But let's concentrate on the matter at hand, which is stack unwinding.
Here's what the first one says what happens when the stack is unwound (emphasis mine):
The other form of unwind is the actual popping of the CPU stack. This doesn’t happen as eagerly as the popping of the SEH records. On X86, EBP is used as the frame pointer for methods containing SEH. ESP points to the top of the stack, as always. Until the stack is actually unwound, all the handlers are executed on top of the faulting exception frame. So the stack actually grows when a handler is called for the first or second pass. EBP is set to the frame of the method containing a filter or finally clause so that local variables of that method will be in scope.
The actual popping of the stack doesn’t occur until the catching ‘except’ clause is executed.
Let's modify our previous test program to check this:
internal class Program
{
private int _maxDepth;
private UIntPtr _stackStart;
[MethodImpl(MethodImplOptions.NoInlining)]
private int Calculate(int depth)
{
try
{
Console.WriteLine("In try at depth {0}: stack frame count = {1}, stack offset = {2}",depth, new StackTrace().FrameCount, GetLooseStackOffset());
if (depth >= _maxDepth)
throw new InvalidOperationException("Max. recursion depth reached.");
return Calculate2(depth + 1);
}
catch
{
Console.WriteLine("In catch at depth {0}: stack frame count = {1}, stack offset = {2}", depth, new StackTrace().FrameCount, GetLooseStackOffset());
throw;
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private int Calculate2(int depth) => Calculate(depth);
public void Run()
{
try
{
_stackStart = GetSomePointerNearTheTopOfTheStack();
_maxDepth = 10;
Calculate(0);
}
catch (Exception e)
{
Console.WriteLine("Finished with " + e.GetType());
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static unsafe UIntPtr GetSomePointerNearTheTopOfTheStack()
{
int dummy;
return new UIntPtr(&dummy);
}
private int GetLooseStackOffset() => (int)((ulong)_stackStart - (ulong)GetSomePointerNearTheTopOfTheStack());
public static void Main() => new Program().Run();
}
Here's the result:
In try at depth 0: stack frame count = 3, stack offset = 384
In try at depth 1: stack frame count = 5, stack offset = 752
In try at depth 2: stack frame count = 7, stack offset = 1120
In try at depth 3: stack frame count = 9, stack offset = 1488
In try at depth 4: stack frame count = 11, stack offset = 1856
In try at depth 5: stack frame count = 13, stack offset = 2224
In try at depth 6: stack frame count = 15, stack offset = 2592
In try at depth 7: stack frame count = 17, stack offset = 2960
In try at depth 8: stack frame count = 19, stack offset = 3328
In try at depth 9: stack frame count = 21, stack offset = 3696
In try at depth 10: stack frame count = 23, stack offset = 4064
In catch at depth 10: stack frame count = 23, stack offset = 13024
In catch at depth 9: stack frame count = 21, stack offset = 21888
In catch at depth 8: stack frame count = 19, stack offset = 30752
In catch at depth 7: stack frame count = 17, stack offset = 39616
In catch at depth 6: stack frame count = 15, stack offset = 48480
In catch at depth 5: stack frame count = 13, stack offset = 57344
In catch at depth 4: stack frame count = 11, stack offset = 66208
In catch at depth 3: stack frame count = 9, stack offset = 75072
In catch at depth 2: stack frame count = 7, stack offset = 83936
In catch at depth 1: stack frame count = 5, stack offset = 92800
In catch at depth 0: stack frame count = 3, stack offset = 101664
Finished with System.InvalidOperationException
Oops. Yes, the stack actually grows while we're handling the exceptions.
At _maxDepth = 1000
, the execution ends at:
In catch at depth 933: stack frame count = 1869, stack offset = 971232
In catch at depth 932: stack frame count = 1867, stack offset = 980096
In catch at depth 931: stack frame count = 1865, stack offset = 988960
In catch at depth 930: stack frame count = 1863, stack offset = 997824
Process is terminated due to StackOverflowException.
So about 997824 bytes of used stack space by our own code, which is pretty close to the 1 MB default thread stack size on Windows. The calling CLR code should make up for the difference.
As you may know, SEH exceptions are handled in two passes:
catch
clause matches the correct exception type, and executes the when
part of catch (...) when (...)
if one is present.Here's what the second article says on the unwinding process:
When an exception occurs, the system walks the list of
EXCEPTION_REGISTRATION
structures until it finds a handler for the exception. Once a handler is found, the system walks the list again, up to the node that will handle the exception. During this second traversal, the system calls each handler function a second time. The key distinction is that in the second call, the value 2 is set in the exception flags. This value corresponds toEH_UNWINDING
.[...]
After an exception is handled and all the previous exception frames have been called to unwind, execution continues wherever the handling callback decides.
This only confirms what the first article said.
The first pass needs to have the faulting stack preserved to be able to inspect its state, and to be able to resume execution on the faulting instruction (yes, that's a thing - it's very low-level but you can patch the cause of the error and resume execution as if there was no error in the first place).
The second pass is implemented just like the first one, except the handlers now get the EH_UNWINDING
flag. This means the faulting stack is still preserved at that point, until the final handler decides where to resume execution.
The stack pointer moves 36 bytes for a 32-Bit process, but whopping 8976 bytes for a 64-bit process here while unwinding the stack. What's the explanation for this?
Good question!
That's because 32-bit and 64-bit SEH are completely different. Here's some reading material (emphasis mine):
Because on the x86 each function that uses SEH has this aforementioned construct as part of its prolog, the x86 is said to use frame based exception handling. There are a couple of problems with this approach:
- Because the exception information is stored on the stack, it is susceptible to buffer overflow attacks.
- Overhead. Exceptions are, well, exceptional, which means the exception will not occur in the common case. Regardless, every time a function is entered that uses SEH, these extra instructions are executed.
Because the x64 was a chance to do away with a lot of the cruft that had been hanging around for decades, SEH got an overhaul that addressed both issues mentioned above. On the x64, SEH has become table-based, which means when the source code is compiled, a table is created that fully describes all the exception handling code within the module. This table is then stored as part of the PE header. If an exception occurs, the exception table is parsed by Windows to find the appropriate exception handler to execute. Because exception handling information is tucked safely away in the PE header, it is no longer susceptible to buffer overflow attacks. In addition, because the exception table is generated as part of the compilation process, no runtime overhead (in the form of push and pop instructions) is incurred during normal processing.
Of course, table-based exception handling schemes have a couple of negative aspects of their own. For example, table-based schemes tend to take more space in memory than stack-based schemes. Also, while overhead in the normal execution path is reduced, the overhead it takes to process an exception is significantly higher than in frame-based approaches. Like everything in life, there are trade-offs to consider when evaluating whether the table-based or a frame-based approach to exception handling is "best."
In short, the happy path has been optimized in x64, while the exceptional path has become slower. If you ask me, that's a very good thing.
Here's another citation from the first article I linked to earlier:
Both IA64 and AMD64 have a model for exception handling that avoids reliance on an explicit handler chain that starts in TLS and is threaded through the stack. Instead, exception handling relies on the fact that on 64-bit systems we can perfectly unwind a stack. And this ability is itself due to the fact that these chips are severely constrained on the calling conventions they support.
[...]
Anyway, on 64-bit systems the correspondence between an activation record on the stack and the exception record that applies to it is not achieved through an FS:[0] chain. Instead, unwinding of the stack reveals the code addresses that correspond to a particular activation record. These instruction pointers of the method are looked up in a table to find out whether there are any__try/__except/__finally clauses that cover these code addresses. This table also indicates how to proceed with the unwind by describing the actions of the method epilog.
So, yeah. Completely different approach.
But let's take a look at the x64 call stack to see where the stack space is actually used. I modified Calculate
as follows:
private int Calculate(int depth)
{
try
{
if (depth >= _maxDepth)
throw new InvalidOperationException("Max. recursion depth reached.");
return Calculate2(depth + 1);
}
catch
{
if (depth == _maxDepth)
{
Console.ReadLine();
}
throw;
}
}
I put a breakpoint on Console.ReadLine();
and took a look at the native call stack along with the value of the stack pointer register on each frame:
The addresses fffffffffffffffe
and 0000000000008000
look very odd to me, but anyway this shows how much stack space each frame consumes. A lot of stuff is going on in the Windows Native API (ntdll.dll), and the CLR adds some.
We're out of luck as far as the internal Windows stuff goes, as the source code is not publicly available. But we can at least take a look at clr.dll!ClrUnwindEx
, as that function is quite small but uses quite a lot of stack space:
void ClrUnwindEx(EXCEPTION_RECORD* pExceptionRecord, UINT_PTR ReturnValue, UINT_PTR TargetIP, UINT_PTR TargetFrameSp)
{
PVOID TargetFrame = (PVOID)TargetFrameSp;
CONTEXT ctx;
RtlUnwindEx(TargetFrame,
(PVOID)TargetIP,
pExceptionRecord,
(PVOID)ReturnValue, // ReturnValue
&ctx,
NULL); // HistoryTable
// doesn't return
UNREACHABLE();
}
It defines a CONTEXT
variable on the stack, which is a large struct. I can only hypothesize that the 64-bit SEH functions use their stack space for similar purposes.
Now let's compare this to the 32-bit call stack:
As you can see, it's not the same thing as in 64-bit at all.
Out of curiosity, I tested the behavior of a simple C++ program:
#include "stdafx.h"
#include <iostream>
__declspec(noinline) static char* GetSomePointerNearTheTopOfTheStack()
{
char dummy;
return &dummy;
}
int main()
{
auto start = GetSomePointerNearTheTopOfTheStack();
try
{
throw 42;
}
catch (...)
{
auto here = GetSomePointerNearTheTopOfTheStack();
std::cout << "Difference in " << (sizeof(char*) * 8) << "-bit: " << (start - here) << std::endl;
}
return 0;
}
Here are the results:
Difference in 32-bit: 2224
Difference in 64-bit: 9744
Which further shows that it's not only a CLR thing, but it's due to the underlying SEH implementation difference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With