Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does .NET IL always create new string objects even when the higher level code references existing ones?

Tags:

.net

Background: We have an XML document containing thousands of pseudocode functions. I've written a utility to parse this document and generate C# code from it. Here's a greatly simplified snippet of the code that gets generated:

public class SomeClass
{
    public string Func1() { return "Some Value"; }
    public string Func2() { return "Some Other Value"; }
    public string Func3() { return "Some Value"; }
    public string Func4() { return "Some Other Value"; }
    // ...
}

The important takeaway is each string value may get returned by multiple methods. I assumed that by making a minor change so that the methods would instead return references to static member strings, this would both cut down on the assembly size and reduce the memory footprint of the program. For example:

public class SomeClass
{
    private const string _SOME_VALUE = "Some Value";
    private const string _SOME_OTHER_VALUE = "Some Other Value";
    // ...

    public string Func1() { return _SOME_VALUE; }
    public string Func2() { return _SOME_OTHER_VALUE; }
    public string Func3() { return _SOME_VALUE; }
    public string Func4() { return _SOME_OTHER_VALUE; }
    // ...
}

But to my surprise, inspection using the .NET ildasm.exe utility shows that in both cases the IL for the functions is identical. Here it is for one of them. Either way, a hard-coded value gets used with ldstr:

.method public hidebysig instance string
        Func1() cil managed
{
  // Code size       6 (0x6)
  .maxstack  8
  IL_0000:  ldstr      "Some Value"
  IL_0005:  ret
} // end of method SomeClass::Func1

In fact, the "optimized" version is slightly worse because it includes the static string members in the assembly. When I repeat this experiment using some other object type besides string, I see the difference that I expect. Note that the assemblies are generated with optimization enabled.

Question: Why does .NET apparently always create a new string object regardless of whether the code references an existing one?

like image 699
Kevin D. Avatar asked Jul 13 '11 01:07

Kevin D.


2 Answers

  IL_0000:  ldstr      "Some Value"
  IL_0005:  ret

The disassembler is being too helpful to show you what is really going on. You can tell from the IL address, note that the ldstr instruction takes only 5 bytes. Way too few to store that string. Use View + Show token values to see what it really looks like. You'll now also see that the same strings uses the same token value. This is called 'interning'.

The token value still doesn't show you where the string is really stored after the program is jitted. String literals go into the 'loader heap', a heap distinct from the garbage collected heap. It is the heap where static items are stored. Or to put it another way: string literals are highly optimized and very cheap. You cannot do better yourself.

like image 53
Hans Passant Avatar answered Dec 08 '22 22:12

Hans Passant


See http://msdn.microsoft.com/en-us/library/system.reflection.emit.opcodes.ldstr(v=vs.71).aspx

The Common Language Infrastructure (CLI) guarantees that the result of two ldstr instructions referring to two metadata tokens that have the same sequence of characters return precisely the same string object (a process known as "string interning").

like image 29
Ian Mercer Avatar answered Dec 08 '22 21:12

Ian Mercer