Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the dangers when creating a thread with a stack size of 50x the default?

People also ask

What is the default size of the stack?

Stacks are temporary memory address spaces used to hold arguments and automatic variables during invocation of a subprogram or function reference. In general, the default main stack size is 8 megabytes.

What is the maximum size of stack?

In Microsoft Windows 2000, if the Microsoft ASP.NET Worker Process (ASPNet_wp.exe) creates a thread, the maximum stack size of the thread is 1 MB. In Windows Server 2008 and higher, the maximum stack size of a thread running on 32-bit version of IIS is 256 KB, and on an x64 server is 512 KB.

Is the stack size affected by the stack frame size?

There is no standard minimum size for a stack frame. The maximum size of a stack frame depends on the platform and implementation. A common implementation is to have the stack to expand towards the heap and the heap expands towards the stack.

Why is stack size limit?

Because all threads in a process share the same address space, they have to divide it between them. And after the operating system has taken its part, there is "only" 2-3 GB left for an application. And that size is the limit for both the physical and the virtual memory, because there just aren't any more addresses.


Upon comparing test code with Sam, I determined that we are both right!
However, about different things:

  • Accessing memory (reading and writing) is just as fast wherever it is - stack, global or heap.
  • Allocating it, however, is fastest on stack and slowest on heap.

It goes like this: stack < global < heap. (allocation time)
Technically, stack allocation isn't really an allocation, the runtime just makes sure a part of the stack (frame?) is reserved for the array.

I strongly advise being careful with this, though.
I recommend the following:

  1. When you need to create arrays frequently which never leave the function (e.g. by passing its reference), using the stack will be an enormous improvement.
  2. If you can recycle an array, do so whenever you can! The heap is the best place for long-term object storage. (polluting global memory isn't nice; stack frames can disappear)

(Note: 1. only applies to value types; reference types will be allocated on the heap and the benefit will be reduced to 0)

To answer the question itself: I have not encountered any problem at all with any large-stack test.
I believe the only possible problems are a stack overflow, if you are not careful with your function calls and running out of memory when creating your thread(s) if the system is running low.

The section below is my initial answer. It is wrong-ish and the tests aren't correct. It is kept only for reference.


My test indicates the stack-allocated memory and global memory is at least 15% slower than (takes 120% the time of) heap-allocated memory for usage in arrays!

This is my test code, and this is a sample output:

Stack-allocated array time: 00:00:00.2224429
Globally-allocated array time: 00:00:00.2206767
Heap-allocated array time: 00:00:00.1842670
------------------------------------------
Fastest: Heap.

  |    S    |    G    |    H    |
--+---------+---------+---------+
S |    -    | 100.80 %| 120.72 %|
--+---------+---------+---------+
G |  99.21 %|    -    | 119.76 %|
--+---------+---------+---------+
H |  82.84 %|  83.50 %|    -    |
--+---------+---------+---------+
Rates are calculated by dividing the row's value to the column's.

I tested on Windows 8.1 Pro (with Update 1), using an i7 4700 MQ, under .NET 4.5.1
I tested both with x86 and x64 and the results are identical.

Edit: I increased the stack size of all threads 201 MB, the sample size to 50 million and decreased iterations to 5.
The results are the same as above:

Stack-allocated array time: 00:00:00.4504903
Globally-allocated array time: 00:00:00.4020328
Heap-allocated array time: 00:00:00.3439016
------------------------------------------
Fastest: Heap.

  |    S    |    G    |    H    |
--+---------+---------+---------+
S |    -    | 112.05 %| 130.99 %|
--+---------+---------+---------+
G |  89.24 %|    -    | 116.90 %|
--+---------+---------+---------+
H |  76.34 %|  85.54 %|    -    |
--+---------+---------+---------+
Rates are calculated by dividing the row's value to the column's.

Though, it seems the stack is actually getting slower.


I've discovered a 530% increase in processing speed!

That's by far the biggest danger I'd say. There's something seriously wrong with your benchmark, code that behaves this unpredictably usually has a nasty bug hidden somewhere.

It is very, very difficult to consume a lot of stack space in a .NET program, other than by excessive recursion. The size of the stack frame of managed methods are set in stone. Simply the sum of the arguments of the method and the local variables in a method. Minus the ones that can be stored in a CPU register, you can ignore that since there are so few of them.

Increasing the stack size doesn't accomplish anything, you'll just reserve a bunch of address space that will never be used. There is no mechanism that can explain a perf increase from not using memory of course.

This is unlike a native program, particularly one written in C, it can also reserve space for arrays on the stack frame. The basic malware attack vector behind stack buffer overflows. Possible in C# as well, you'd have to use the stackalloc keyword. If you are doing that then the obvious danger is having to write unsafe code that is subject to such attacks, as well as random stack frame corruption. Very hard to diagnose bugs. There is a counter-measure against this in later jitters, I think starting at .NET 4.0, where the jitter generates code to put a "cookie" on the stack frame and checks if it is still intact when the method returns. Instant crash to the desktop without any way to intercept or report the mishap if that happens. That's ... dangerous to the user's mental state.

The main thread of your program, the one started by the operating system, will have a 1 MB stack by default, 4 MB when you compile your program targeting x64. Increasing that requires running Editbin.exe with the /STACK option in a post build event. You can typically ask for up to 500 MB before your program will have trouble getting started when running in 32-bit mode. Threads can too, much easier of course, the danger zone typically hovers around 90 MB for a 32-bit program. Triggered when your program has been running for a long time and address space got fragmented from previous allocations. Total address space usage must already be high, over a gig, to get this failure mode.

Triple-check your code, there's something very wrong. You can't get a x5 speedup with a bigger stack unless you explicitly write your code to take advantage of it. Which always requires unsafe code. Using pointers in C# always has a knack for creating faster code, it isn't subjected to the array bounds checks.


I would have a reservation there that I simply wouldn't know how to predict it - permissions, GC (which needs to scan the stack), etc - all could be impacted. I would be very tempted to use unmanaged memory instead:

var ptr = Marshal.AllocHGlobal(sizeBytes);
try
{
    float* x = (float*)ptr;
    DoWork(x);
}
finally
{
    Marshal.FreeHGlobal(ptr);
}

One thing that can go wrong is that you might not get the permission to do so. Unless running in full-trust mode, the Framework will just ignore the request for a larger stack size (see MSDN on Thread Constructor (ParameterizedThreadStart, Int32))

Instead of increasing the system stack size to such huge numbers, I would suggest to rewrite your code so that it uses Iteration and a manual stack implementation on the heap.


The high performant arrays might be accessible in the same way as a normal C# one but that could be the beginning of trouble: Consider the following code:

float[] someArray = new float[100]
someArray[200] = 10.0;

You expect an out of bound exception and this completely makes sense because you are trying to access element 200 but the maximum allowed value is 99. If you go to the stackalloc route then there will be no object wrapped around your array to bound check and the following will not show any exception:

Float* pFloat =  stackalloc float[100];
fFloat[200]= 10.0;

Above you are allocating enough memory to hold 100 floats and you are setting the sizeof(float) memory location which starts at the location started of this memory + 200*sizeof(float) for holding your float value 10. Unsurprisingly this memory is outside the allocated memory for the floats and nobody would know what could be stored in that address. If you are lucky you might have used some currently unused memory but at the same time it is likely you might overwrite some location that was used for storing other variables. To Summarize: Unpredictable runtime behaviour.


Microbenchmarking languages with JIT and GC such as Java or C# can be a bit complicated, so it's generally a good idea to use an existing framework - Java offers mhf or Caliper which are excellent, sadly to the best of my knowledge C# doesn't offer anything approaching those. Jon Skeet wrote this here which I'll blindly assume takes care of the most important things (Jon knows what he's doing in that area; also yes no worries I did actually check). I tweaked the timing a bit because 30 seconds per test after warmup was too much for my patience (5 seconds ought to do).

So first the results, .NET 4.5.1 under Windows 7 x64 -- the numbers denote the iterations it could run in 5 seconds so higher is better.

x64 JIT:

Standard       10,589.00  (1.00)
UnsafeStandard 10,612.00  (1.00)
Stackalloc     12,088.00  (1.14)
FixedStandard  10,715.00  (1.01)
GlobalAlloc    12,547.00  (1.18)

x86 JIT (yeah that's still kind of sad):

Standard       14,787.00   (1.02)
UnsafeStandard 14,549.00   (1.00)
Stackalloc     15,830.00   (1.09)
FixedStandard  14,824.00   (1.02)
GlobalAlloc    18,744.00   (1.29)

This gives a much more reasonable speedup of at most 14% (and most of the overhead is due to the GC having to run, consider it a worst case scenario realistically). The x86 results are interesting though - not entirely clear what's going on there.

and here's the code:

public static float Standard(int size) {
    float[] samples = new float[size];
    for (var ii = 0; ii < size; ii++) {
        samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
    }
    return samples[size - 1];
}

public static unsafe float UnsafeStandard(int size) {
    float[] samples = new float[size];
    for (var ii = 0; ii < size; ii++) {
        samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
    }
    return samples[size - 1];
}

public static unsafe float Stackalloc(int size) {
    float* samples = stackalloc float[size];
    for (var ii = 0; ii < size; ii++) {
        samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
    }
    return samples[size - 1];
}

public static unsafe float FixedStandard(int size) {
    float[] prev = new float[size];
    fixed (float* samples = &prev[0]) {
        for (var ii = 0; ii < size; ii++) {
            samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
        }
        return samples[size - 1];
    }
}

public static unsafe float GlobalAlloc(int size) {
    var ptr = Marshal.AllocHGlobal(size * sizeof(float));
    try {
        float* samples = (float*)ptr;
        for (var ii = 0; ii < size; ii++) {
            samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
        }
        return samples[size - 1];
    } finally {
        Marshal.FreeHGlobal(ptr);
    }
}

static void Main(string[] args) {
    int inputSize = 100000;
    var results = TestSuite.Create("Tests", inputSize, Standard(inputSize)).
        Add(Standard).
        Add(UnsafeStandard).
        Add(Stackalloc).
        Add(FixedStandard).
        Add(GlobalAlloc).
        RunTests();
    results.Display(ResultColumns.NameAndIterations);
}

Since the performance difference is too huge, the problem is barely related to allocation. It is likely caused by the array access.

I disassembled the loop body of the functions:

TestMethod1:

IL_0011:  ldloc.0 
IL_0012:  ldloc.1 
IL_0013:  ldc.i4.4 
IL_0014:  mul 
IL_0015:  add 
IL_0016:  ldc.r4 32768.
IL_001b:  stind.r4 // <----------- This one
IL_001c:  ldloc.1 
IL_001d:  ldc.i4.1 
IL_001e:  add 
IL_001f:  stloc.1 
IL_0020:  ldloc.1 
IL_0021:  ldc.i4 12500000
IL_0026:  blt IL_0011

TestMethod2:

IL_0012:  ldloc.0 
IL_0013:  ldloc.1 
IL_0014:  ldc.r4 32768.
IL_0019:  stelem.r4 // <----------- This one
IL_001a:  ldloc.1 
IL_001b:  ldc.i4.1 
IL_001c:  add 
IL_001d:  stloc.1 
IL_001e:  ldloc.1 
IL_001f:  ldc.i4 12500000
IL_0024:  blt IL_0012

We can check the usage of the instruction and more importantly, the exception they throw in ECMA spec:

stind.r4: Store value of type float32 into memory at address

Exceptions it throws:

System.NullReferenceException

And

stelem.r4: Replace array element at index with the float32 value on the stack.

Exception it throws:

System.NullReferenceException
System.IndexOutOfRangeException
System.ArrayTypeMismatchException

As you can see, stelem does more work in array range checking and type checking. Since the loop body does little thing (only assign value), the overhead of the checking dominates the computation time. So that’s why the performance differs by 530%.

And this also answers your questions: the danger is the absent of array range & type checking. This is unsafe (as mentioned in the function declaration ;D).