Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are strings copied in .NET?

Tags:

string

.net

Since strings are immutable in .NET, why are they copied for simple operations such as Substring or Split? For example, by keeping a char[] value, int start and int length, a substring could be created to simply point to an existing string, and we could save the overhead of copying the string for many simple operations. So I wonder, why was the decision chosen to copy strings for such operations?

For example, was this done to support the current implementation of StringBuilder? Or to avoid keeping a reference to a large char[] when only a few characters are required? Or any other reason you can think of? Can you suggest pros and cons for such design?

As mentioned by @cletus and supported by @Jon Skeet, this is more like asking why .NET strings were built differently from Java in this aspect.

like image 527
Hosam Aly Avatar asked Dec 06 '22 06:12

Hosam Aly


1 Answers

That's basically the way that Java works. There are a few benefits of the .NET way, IMO:

  • Locality of reference - the data and the length are in the same place
  • Fewer dereferences - the data is at a fixed point within the string object itself; no need to dereference another char array
  • Lack of aliasing when you've got a single character substring of an originally-large string, as mentioned by Renaud.
  • You end up with fewer objects and variables. In the case of a .NET string (assuming no wasted buffer space), the total size (on x86) is approximately 20+2*n bytes. In Java you've got the size of the array (12 + 2*n) bytes and the string itself (24 bytes: object overhead, reference, start and count; it also caches the hash if it's ever calculated it). So for an empty string, the .NET version takes about 20 bytes compared with Java's 36. Of course that's the worst case, and it'll only be that "constant difference" out - but if you use a lot of independent strings that could end up being significant. More for the garbage collector to look at, too.

Of course, the benefits are in terms of requiring less space when the aliasing above doesn't occur.

In the end it will depend on your usage - the compiler and runtime can't predict which usage pattern is more likely in your exact code.

There may also be interop benefits of the current string representation, but I don't know enough about that to say for sure.

EDIT: I'm not sure why your question has received so many somewhat-hostile answers. It's certainly not a "dumb" way of representing a string, and it clearly works. Fears about data loss and complexity are pretty much just FUD in this case, I believe - the Java string implementation is simple and robust. I personally suspect that the .NET way of doing things is more efficient in most programs, and I suspect MS did research to check that, but there will certainly be situations where the "shared" model works better.

like image 113
Jon Skeet Avatar answered Jan 03 '23 00:01

Jon Skeet