Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When getting substring in .Net, does the new string reference the same original string data or does the data get copied?

Assuming I have the following strings:

string str1 = "Hello World!";  
string str2 = str1.SubString(6, 5); // "World"

I am hoping that in the above example str2 does not copy "World", but simply ends up being a new string that points to the same memory space only that it starts with an offset of 6 and a length of 5.

In actuality I am dealing with some potentially very long strings and am interested in how this works behind the scenes for performance reasons. I am not familiar enaugh with IL to look into this.

like image 751
Elan Avatar asked Mar 18 '10 22:03

Elan


1 Answers

As others have noted, the CLR makes copies when doing a substring operation.

As you note, it certainly would be possible for a string to be represented as an interior pointer with a length. This makes the substring operation extremely cheap.

There are also ways to make other operations cheap. For example, string concatenation can be made cheap by representing strings as a tree of substrings.

In both cases what is happening here is the result of the operation is not actually the "result" itself, per se, but rather, a cheap object which represents the ability to get at the results when needed.

The attentive reader will have just realized that this is how LINQ works. When we say

var results = from c in customers where c.City == "London" select c.Name;

"results" does not contain the results of the query. This code returns almost immediately; results contains an object which represents the query. Only when the query is iterated does the expensive mechanism of searching the collection spin up. We use the power of a monadic representation of sequence semantics to defer the calculations until later.

The question then becomes "is it a good idea to do the same thing on strings?" and the answer is a resounding "no". I have plenty of painful real-world experiments on this. I once spent a summer rewriting the VBScript compiler's string handling routines to store string concatenations as a tree of string concatenation operations; only when the result is actually being used as a string does the concatenation actually happen. It was disastrous; the additional time and memory needed to keep track of all the string pointers made the 99% case -- someone doing a few simple little string operations to render a web page -- about twice as slow, while massively speeding up the tiny, tiny minority of pages that were written using naive string concatenations.

The vast majority of realistic string operations in .NET programs are extremely fast; they compile down to memory moves that in normal circumstances stay well within the memory blocks that are cached by the processor, and are therefore blazingly fast.

Furthermore, using an "interior pointer" approach for strings complicates the garbage collector considerably; going with such an approach seems to make it likely that the GC would slow down overall, which benefits no one. You have to look at the total cost of the impact of the change, not just its impact on some narrow scenarios.

If you have specific performance needs due to your unusually large data then you should consider writing your own special-purpose string library that uses a "monadic" approach like LINQ does. You can represent your strings internally as arrays of char, and then substring operations simply become copying a reference to the array and changing the start and end positions.

like image 54
Eric Lippert Avatar answered Oct 17 '22 09:10

Eric Lippert