Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unsafe string creation from char[]

I'm working on a high performance code in which this construct is part of the performance critical section.

This is what happens in some section:

  1. A string is 'scanned' and metadata is stored efficiently.
  2. Based upon this metadata chunks of the main string are separated into a char[][].
  3. That char[][] should be transferred into a string[].

Now, I know you can just call new string(char[]) but then the result would have to be copied.

To avoid this extra copy step from happening I guess it must be possible to write directly to the string's internal buffer. Even though this would be an unsafe operation (and I know this bring lots of implications like overflow, forward compatibility).

I've seen several ways of achieving this, but none I'm really satisfied with.

Does anyone have true suggestions as to how to achieve this?

Extra information:
The actual process doesn't include converting to char[] necessarily, it's practically a 'multi-substring' operation. Like 3 indexes and their lengths appended.

The StringBuilder has too much overhead for the small number of concats.

EDIT:
Due to some vague aspects of what it is exactly that I'm asking, let me reformulate it.

This is what happens:

  1. Main string is indexed.
  2. Parts of the main string are copied to a char[].
  3. The char[] is converted to a string.

What I'd like to do is merge step 2 and 3, resulting in:

  1. Main string is indexed.
  2. Parts of the main string are copied to a string (and the GC can keep its hands off of it during the process by proper use of the fixed keyword?).

And a note is that I cannot change the output type from string[], since this is an external library, and projects depend on it (backward compatibility).

like image 210
Aidiakapi Avatar asked Jan 11 '12 21:01

Aidiakapi


2 Answers

I think that what you are asking to do is to 'carve up' an existing string in-place into multiple smaller strings without re-allocating character arrays for the smaller strings. This won't work in the managed world.

For one reason why, consider what happens when the garbage collector comes by and collects or moves the original string during a compaction- all of those other strings 'inside' of it are now pointing at some arbitrary other memory, not the original string you carved them out of.

EDIT: In contrast to the character-poking involved in Ben's answer (which is clever but IMHO a bit scary), you can allocate a StringBuilder with a pre-defined capacity, which eliminates the need to re-allocate the internal arrays. See http://msdn.microsoft.com/en-us/library/h1h0a5sy.aspx.

like image 163
Chris Shain Avatar answered Oct 11 '22 02:10

Chris Shain


What happens if you do:

string s = GetBuffer();
fixed (char* pch = s) {
    pch[0] = 'R';
    pch[1] = 'e';
    pch[2] = 's';
    pch[3] = 'u';
    pch[4] = 'l';
    pch[5] = 't';
}

I think the world will come to an end (Or at least the .NET managed portion of it), but that's very close to what StringBuilder does.

Do you have profiler data to show that StringBuilder isn't fast enough for your purposes, or is that an assumption?

like image 35
Ben Voigt Avatar answered Oct 11 '22 00:10

Ben Voigt