Why is there so much speed change between boxing and unboxing operations? There is 10 times difference. When should we care about this? Last week an Azure support told us there is an issue in the heap memory of our application. I am curious to know if it could be related to the boxing-unboxing issue.
using System;
using System.Diagnostics;
namespace ConsoleBoxing
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Program started");
var elapsed = Boxing();
Unboxing(elapsed);
Console.WriteLine("Program ended");
Console.Read();
}
private static void Unboxing(double boxingtime)
{
Stopwatch s = new Stopwatch();
s.Start();
for (int i = 0; i < 1000000; i++)
{
int a = 33;//DATA GOES TO STACK
object b = a;//HEAP IS REFERENCED
int c = (int)b;//unboxing only hEre ....HEAP GOES TO STACK
}
s.Stop();
var UnBoxing = s.Elapsed.TotalMilliseconds- boxingtime;
Console.WriteLine("UnBoxing time : " + UnBoxing);
}
private static double Boxing()
{
Stopwatch s = new Stopwatch();
s.Start();
for (int i = 0; i < 1000000; i++)
{
int a = 33;
object b = a;
}
s.Stop();
var elapsed = s.Elapsed.TotalMilliseconds;
Console.WriteLine("Boxing time : " + elapsed);
return elapsed;
}
}
}
Although people have offered fantastic explanations already for why unboxing is faster than boxing. I want to say a little bit more on the methodology you used to test the performance difference.
Did you get your result (10x difference in speed) from the code you posted? If I run that program in release mode, here is the output:
Program started
Boxing time : 0.2741
UnBoxing time : 4.5847
Program ended
Whenever I am doing a micro performance benchmark, I tend to further verify I am indeed comparing the operation I intended to compare. Compiler can make optimization to your code. Open the executable in ILDASM:
Here is the IL for UnBoxing: (I only included the portion that matters most)
IL_0000: newobj instance void [System]System.Diagnostics.Stopwatch::.ctor()
IL_0005: stloc.0
IL_0006: ldloc.0
IL_0007: callvirt instance void [System]System.Diagnostics.Stopwatch::Start()
IL_000c: ldc.i4.0
IL_000d: stloc.1
IL_000e: br.s IL_0025
IL_0010: ldc.i4.s 33
IL_0012: stloc.2
IL_0013: ldloc.2
IL_0014: box [mscorlib]System.Int32 //Here is the boxing
IL_0019: stloc.3
IL_001a: ldloc.3
IL_001b: unbox.any [mscorlib]System.Int32 //Here is the unboxing
IL_0020: pop
IL_0021: ldloc.1
IL_0022: ldc.i4.1
IL_0023: add
IL_0024: stloc.1
IL_0025: ldloc.1
IL_0026: ldc.i4 0xf4240
IL_002b: blt.s IL_0010
IL_002d: ldloc.0
IL_002e: callvirt instance void [System]System.Diagnostics.Stopwatch::Stop()
And this is the code for Boxing:
IL_0000: newobj instance void [System]System.Diagnostics.Stopwatch::.ctor()
IL_0005: stloc.0
IL_0006: ldloc.0
IL_0007: callvirt instance void [System]System.Diagnostics.Stopwatch::Start()
IL_000c: ldc.i4.0
IL_000d: stloc.1
IL_000e: br.s IL_0017
IL_0010: ldc.i4.s 33
IL_0012: stloc.2
IL_0013: ldloc.1
IL_0014: ldc.i4.1
IL_0015: add
IL_0016: stloc.1
IL_0017: ldloc.1
IL_0018: ldc.i4 0xf4240
IL_001d: blt.s IL_0010
IL_001f: ldloc.0
IL_0020: callvirt instance void [System]System.Diagnostics.Stopwatch::Stop()
No boxing instruction at all in the Boxing method. It has been completely removed by compiler. The Boxing method does nothing but iterating an empty loop. The time measured in UnBoxing therefore becomes the total time of boxing and unboxing.
Micro-benchmarking is very vulnerable to compiler tricks. I would suggest you have a look at your IL as well. It may be different if you are using a different compiler.
I modified your test code a little bit:
Boxing method:
private static object Boxing()
{
Stopwatch s = new Stopwatch();
int unboxed = 33;
object boxed = null;
s.Start();
for (int i = 0; i < 1000000; i++)
{
boxed = unboxed;
}
s.Stop();
var elapsed = s.Elapsed.TotalMilliseconds;
Console.WriteLine("Boxing time : " + elapsed);
return boxed;
}
And Unboxing method:
private static int Unboxing()
{
Stopwatch s = new Stopwatch();
object boxed = 33;
int unboxed = 0;
s.Start();
for (int i = 0; i < 1000000; i++)
{
unboxed = (int)boxed;
}
s.Stop();
var time = s.Elapsed.TotalMilliseconds;
Console.WriteLine("UnBoxing time : " + time);
return unboxed;
}
So that they can be translated into similar IL:
For Boxing method:
IL_000c: callvirt instance void [System]System.Diagnostics.Stopwatch::Start()
IL_0011: ldc.i4.0
IL_0012: stloc.3
IL_0013: br.s IL_0020
IL_0015: ldloc.1
IL_0016: box [mscorlib]System.Int32 //Here is the boxing
IL_001b: stloc.2
IL_001c: ldloc.3
IL_001d: ldc.i4.1
IL_001e: add
IL_001f: stloc.3
IL_0020: ldloc.3
IL_0021: ldc.i4 0xf4240
IL_0026: blt.s IL_0015
IL_0028: ldloc.0
IL_0029: callvirt instance void [System]System.Diagnostics.Stopwatch::Stop()
For UnBoxing:
IL_0011: callvirt instance void [System]System.Diagnostics.Stopwatch::Start()
IL_0016: ldc.i4.0
IL_0017: stloc.3
IL_0018: br.s IL_0025
IL_001a: ldloc.1
IL_001b: unbox.any [mscorlib]System.Int32 //Here is the UnBoxng
IL_0020: stloc.2
IL_0021: ldloc.3
IL_0022: ldc.i4.1
IL_0023: add
IL_0024: stloc.3
IL_0025: ldloc.3
IL_0026: ldc.i4 0xf4240
IL_002b: blt.s IL_001a
IL_002d: ldloc.0
IL_002e: callvirt instance void [System]System.Diagnostics.Stopwatch::Stop()
Run several loops to remove the cold startup effect:
static void Main(string[] args)
{
Console.WriteLine("Program started");
for (int i = 0; i < 10; i++)
{
Boxing();
Unboxing();
}
Console.WriteLine("Program ended");
Console.Read();
}
Here is the output:
Program started
Boxing time : 3.4814
UnBoxing time : 0.1712
Boxing time : 2.6294
...
Boxing time : 2.4842
UnBoxing time : 0.1712
Program ended
Does that prove that unboxing is 10x faster than boxing? Let's checkout the assembly code with windbg:
0:004> !u 000007fe93b83940
Normal JIT generated code
MicroBenchmarks.Program.Boxing()
...
000007fe`93ca01b3 call System_ni+0x2905e0 (000007fe`f07a05e0) (System.Diagnostics.Stopwatch.GetTimestamp(), mdToken: 00000000060040d2)
...
//This is the for loop
000007fe`93ca01c2 mov eax,21h
000007fe`93ca01c7 mov dword ptr [rsp+20h],eax
000007fe`93ca01cb lea rdx,[rsp+20h]
000007fe`93ca01d0 lea rcx,[mscorlib_ni+0x6e92b0 (000007fe`f18b92b0)]
//here is the boxing
000007fe`93ca01d7 call clr!JIT_BoxFastMP_InlineGetThread (000007fe`f33126d0)
000007fe`93ca01dc mov rsi,rax
//loop unrolling. instead of increment i by 1, we are actually incrementing i by 4
000007fe`93ca01df add edi,4
000007fe`93ca01e2 cmp edi,0F4240h // 0F4240h = 1000000
000007fe`93ca01e8 jl 000007fe`93ca01c2 // jumps to the line "mov eax,21h"
//end of the for loop
000007fe`93ca01ea mov rcx,rbx
000007fe`93ca01ed call System_ni+0x2acb70 (000007fe`f07bcb70) (System.Diagnostics.Stopwatch.Stop(), mdToken: 00000000060040cb)
The assembly for UnBoxing:
0:004> !u 000007fe93b83930
Normal JIT generated code
MicroBenchmarks.Program.Unboxing()
Begin 000007fe93ca02c0, size 117
000007fe`93ca02c0 push rbx
...
000007fe`93ca030a call System_ni+0x2905e0 (000007fe`f07a05e0) (System.Diagnostics.Stopwatch.GetTimestamp(), mdToken: 00000000060040d2)
000007fe`93ca030f mov qword ptr [rbx+10h],rax
000007fe`93ca0313 mov byte ptr [rbx+18h],1
000007fe`93ca0317 xor eax,eax
000007fe`93ca0319 mov edi,dword ptr [rdi+8]
000007fe`93ca031c nop dword ptr [rax]
//This is the for loop
//again, loop unrolling
000007fe`93ca0320 add eax,4
000007fe`93ca0323 cmp eax,0F4240h // 0F4240h = 1000000
000007fe`93ca0328 jl 000007fe`93ca0320 //jumps to "add eax,4"
//end of the for loop
000007fe`93ca032a mov rcx,rbx
000007fe`93ca032d call System_ni+0x2acb70 (000007fe`f07bcb70) (System.Diagnostics.Stopwatch.Stop(), mdToken: 00000000060040cb)
You can see that even if at the IL level the comparison seems to be reasonable, JIT can still perform another optimization at runtime. The UnBoxing method is doing am empty loop again. Untill you verify the code executed for the two methods are comparable, it is very hard to simply conclude "unboxing is 10x faster then boxing"
Think of unboxing as a single memory load instruction from the boxed object to a register. Maybe with a bit of surrounding address calculation and cast validation logic. A boxed object is like a class with one field of the boxed type. How expensive can those operations be? Not very, especially since the L1 cache hit rate in your benchmark is ~100%.
Boxing involves allocating a new object and GC'ing it later. In your code the GC probably triggers on the allocation in 99% of the cases.
That said your benchmark is invalid because the loops have no side-effects. It is probably luck that the current JIT cannot optimize them away. Somehow have the loop compute a result and funnel it into GC.KeepAlive
to make the result appear used. Also, you might be running Debug mode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With