Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to concatenate a list of strings using only a single allocation?

After doing some profiling, we've discovered that the current way in which our app concatenates strings causes an enormous amount of memory churn and CPU time.

We're building a List<string> of strings to concatenate that is on the order of 500 thousand elements long, referencing several hundred megabytes worth of strings. We're trying to optimize this one small part of our app since it seems to account for a disproportionate amount of CPU and memory usage.

We do a lot of text processing :)

Theoretically, we should be able to perform the concatenation in a single allocation and N copies - we can know how many total characters are available in our string, so it should just be as simple as summing up the lengths of the component strings and allocating enough underlying memory to hold the result.

Assuming we're starting with a pre-filled List<string>, is it possible to concatenate all strings in that list using a single allocation?

Currently, we're using the StringBuilder class, but this stores its own intermediate buffer of all of the characters - so we have an ever growing chunk array, with each chunk storing a copy of the characters we're giving it. Far from ideal. The allocations for the array of chunks aren't horrible, but the worst part is that it allocates intermediate character arrays, which means N allocations and copies.

The best we can do right now is to call List<string>.ToArray() - which performs one copy of a 500k element array - and pass the resulting string[] to string.Concat(params string[]). string.Concat() then performs two allocations, one to copy the input array into an internal array, and the one to allocate the destination string's memory.

From referencesource.microsoft.com:

    public static String Concat(params String[] values) {
        if (values == null)
            throw new ArgumentNullException("values");
        Contract.Ensures(Contract.Result<String>() != null);
        // Spec#: Consider a postcondition saying the length of this string == the sum of each string in array
        Contract.EndContractBlock();
        int totalLength=0;

        // -----------> Allocation #1 <---------
        String[] internalValues = new String[values.Length];
        
        for (int i=0; i<values.Length; i++) {
            string value = values[i];
            internalValues[i] = ((value==null)?(String.Empty):(value));
            totalLength += internalValues[i].Length;
            // check for overflow
            if (totalLength < 0) {
                throw new OutOfMemoryException();
            }
        }
        
        return ConcatArray(internalValues, totalLength);
    }

    private static String ConcatArray(String[] values, int totalLength) {

        // -----------------> Allocation #2 <---------------------
        String result =  FastAllocateString(totalLength);
        int currPos=0;

        for (int i=0; i<values.Length; i++) {
            Contract.Assert((currPos <= totalLength - values[i].Length), 
                            "[String.ConcatArray](currPos <= totalLength - values[i].Length)");

            FillStringChecked(result, currPos, values[i]);
            currPos+=values[i].Length;
        }

        return result;
    }

Thus, in the best case, we have three allocations, two for arrays referencing the component strings, and one for the destination concatenated string.

Can we improve on this? Is it possible to concatenate a List<string> using a single allocation and a single loop of character copies?

Edit 1


I'd like to summarize the various approaches discussed so far, and why they are still sub-optimal. I'd also like to set the parameters of the situation in concrete a little more, since I've received a lot of questions that try to side step the central question.

...

First, the structure of the code that I am working within. There are three layers:

  • Layer one is a set of methods that produce my content. These methods return small-ish string objects, which I will call my 'component' strings'. These string objects will eventually be concatenated into a single string. I do not have the ability to modify these methods; I have to face the reality that they return string objects and move forward.
  • Layer two is my code that calls these content producers and assembles the output, and is the subject of this question. I must call the content producer methods, collect the strings they return, and eventually concatenate the returned strings into a single string (reality is a little more complex; the returned strings are partitioned depending on how they're routed for output, and so I have several sets of large collections of strings).
  • Layer three is a set of methods that accept a single large string for further processing. Changing the interface of that code is beyond my control.

Talking about some numbers: a typical batch run will collect ~500000 strings from the content producers, representing about 200-500 MB of memory. I need the most efficient way to concatenate these 500k strings into a single string.

...

Now I'd like to examine the approaches discussed so far. For the sake of numbers, assume we're running 64-bit, assume that we are collecting 500000 string objects, and assume that the aggregate size of the string objects totals 200 megabytes worth of character data. Also, assume that the original string object's memory is not counted toward any approach's total in the below analysis. I make this assumption because it is necessarily common to any and all approaches, because it is an assumption that we cannot change the interface of the content producers - they return 500k relatively small fully formed strings objects that I must then accept and somehow concatenate. As stated above, I cannot change this interface.

Approach #1

Content producers ----> StringBuilder ----> string

Conceptually, this would be invoking the content producers, and directly writing the strings they return to a StringBuilder, and then later calling StringBuilder.ToString() to obtain the concatenated string.

By analyzing StringBuilder's implementation, we can see that the cost of this boils down to 400 MB of allocations and copies:

  • During the stage where we collect the output from the content producers, we're writing 200 MB of data to the StringBuilder. We would be performing one 200 MB allocation to pre-allocate the StringBuilder, and then 200 MB worth of copies as we copy and discard the strings returned from the content producers
  • After we've collected all output from the content producers and have a fully formed StringBuilder, we then need to call StringBuilder.ToString(). This performs exactly one allocation (string.FastAllocateString()), and then copies the string data from its internal buffers to the string object's internal memory.

Total cost: approximately 400 MB of allocations and copies

Approach #2

Content producers ---> pre-allocated char[] ---> string

This strategy is fairly simple. Assuming we know roughly how much character data we're going to be collecting from the producers, we can pre-allocate a char[] that is 200 MB large. Then, as we call the content producers, we copy the strings they return into our char[]. This accounts for 200 MB of allocations and copies. The final step to turn this into a string object is to pass it to the new string(char[]) constructor. However, since strings are immutable and arrays are not, the constructor will make a copy of that entire array, causing it to allocate and copy another 200 MB of character data.

Total cost: approximately 400 MB of allocations and copies

Approach #3:

Content producers ---> List<string> ----> string[] ----> string.Concat(string[])

  • Pre-allocate a List<string> to be about 500k elements - approximately 4 MB of allocations for List's underlying array (500k * 8 bytes per pointer == 4 MB of memory).
  • Call all of the content producers to collect their strings. Approximately 4 MB of copies, as we copy the pointer to the returned string into List's underlying array.
  • Call List<string>.ToArray() to obtain a string[]. Approximately 4 MB of allocations and copies (again, we're really just copying pointers).
  • Call string.Concat(string[]):
    • Concat will make a copy of the array provided to it before it does any real work. Approximately 4 MB of allocations and copies, again.
    • Concat will then allocate a single 'destination' string object using the internal string.FastAllocateString() special method. Approximately 200 MB of allocations.
    • Concat will then copy strings from its internal copy of the provided array directly into the destination. Approximately 200 MB of copies.

Total cost: approximately 212 MB of allocations and copies


None of these approaches are ideal, however approach #3 is very close. We're assuming that the absolute minimum of memory that needs to be allocated and copied is 200 MB (for the destination string), and here we get pretty close - 212 MB.

If there were a string.Concat overload that 1) Accepted an IList<string> and 2) did not make a copy of that IList before using it, then the problem would be solved. No such method is provided by .Net, hence the subject of this question.

Edit 2


Progress on a solution.

I've done some testing with some hacked IL, and found that directly invoking string.FastAllocateString(n) (which is not usually invokable...) is about as fast as invoking new string('\0', n), and both seem to allocate exactly as much memory as is expected.

From there, it seems its possible to acquire a pointer to the freshly allocated string using the unsafe and fixed statements.

And so, a rough solution begins to appear:

    private static string Concat( List<string> list )
    {
        int concatLength = 0;

        for( int i = 0; i < list.Count; i++ )
        {
            concatLength += list[i].Length;
        }

        string newString = new string( '\0', concatLength );

        unsafe
        {
            fixed( char* ptr = newString )
            {
                ...
            }
        }

        return newString;
    }

The next biggest hurdle is implementing or finding an efficient block copy method, ala Buffer.BlockCopy, except one that will accept char* types.

like image 928
antiduh Avatar asked Aug 26 '15 02:08

antiduh


2 Answers

Unless the average length of the strings is very small, the most efficient approach, given a List<String>, will be to use ToArray() to copy it to a new String[], and pass that to a concatenation or joining method. Doing that may cause a wasted allocation for an array of references if the concatenation or joining method wants to make a copy of its array before it starts, but that would only allocate one reference per string, there will only be one allocation to hold character data, and it will be correctly sized to hold the entire string.

If you're building the data structure yourself, you might gain a little bit of efficiency by initializing a String[] to the estimated required size, populating it yourself, and expanding it as needed. That would save one allocation of a String[] worth of data.

Another approach would be to allocate a String[8192][] and then allocate a String[8192] for each array of strings as you go along. Once you're all done, you'll know exactly what size String[] you need to pass to the Concat method so you can create an array of that exact size. This approach would require a greater quantity of allocations, but only the final String[] and the String itself would need to go on the Large Object Heap.

like image 21
supercat Avatar answered Oct 12 '22 13:10

supercat


If you can determine the length of the concatenation before trying to perform the operation, a char array can beat string builder in some use cases. Manipulating the characters within the array prevents the multiple allocations.

See: http://blogs.msdn.com/b/cisg/archive/2008/09/09/performance-analysis-reveals-char-array-is-better-than-stringbuilder.aspx

UPDATE

Please check out this internal implementation of the String.Join from .NET - it uses unsafe code with pointers to avoid multiple allocations. Unless I'm missing something, it would seem you can re-write this using your List to accomplish what you want:

    [System.Security.SecuritySafeCritical]  // auto-generated 
    public unsafe static String Join(String separator, String[] value, int startIndex, int count) {
        //Range check the array 
        if (value == null) 
            throw new ArgumentNullException("value");

        if (startIndex < 0)
            throw new ArgumentOutOfRangeException("startIndex", Environment.GetResourceString("ArgumentOutOfRange_StartIndex"));
        if (count < 0)
            throw new ArgumentOutOfRangeException("count", Environment.GetResourceString("ArgumentOutOfRange_NegativeCount")); 

        if (startIndex > value.Length - count) 
            throw new ArgumentOutOfRangeException("startIndex", Environment.GetResourceString("ArgumentOutOfRange_IndexCountBuffer")); 
        Contract.EndContractBlock();

        //Treat null as empty string.
        if (separator == null) {
            separator = String.Empty;
        } 

        //If count is 0, that skews a whole bunch of the calculations below, so just special case that. 
        if (count == 0) { 
            return String.Empty;
        } 

        int jointLength = 0;
        //Figure out the total length of the strings in value
        int endIndex = startIndex + count - 1; 
        for (int stringToJoinIndex = startIndex; stringToJoinIndex <= endIndex; stringToJoinIndex++) {
            if (value[stringToJoinIndex] != null) { 
                jointLength += value[stringToJoinIndex].Length; 
            }
        } 

        //Add enough room for the separator.
        jointLength += (count - 1) * separator.Length;

        // Note that we may not catch all overflows with this check (since we could have wrapped around the 4gb range any number of times
        // and landed back in the positive range.) The input array might be modifed from other threads, 
        // so we have to do an overflow check before each append below anyway. Those overflows will get caught down there. 
        if ((jointLength < 0) || ((jointLength + 1) < 0) ) {
            throw new OutOfMemoryException(); 
        }

        //If this is an empty string, just return.
        if (jointLength == 0) { 
            return String.Empty;
        } 

        string jointString = FastAllocateString( jointLength );
        fixed (char * pointerToJointString = &jointString.m_firstChar) { 
            UnSafeCharBuffer charBuffer = new UnSafeCharBuffer( pointerToJointString, jointLength);

            // Append the first string first and then append each following string prefixed by the separator.
            charBuffer.AppendString( value[startIndex] ); 
            for (int stringToJoinIndex = startIndex + 1; stringToJoinIndex <= endIndex; stringToJoinIndex++) {
                charBuffer.AppendString( separator ); 
                charBuffer.AppendString( value[stringToJoinIndex] ); 
            }
            Contract.Assert(*(pointerToJointString + charBuffer.Length) == '\0', "String must be null-terminated!"); 
        }

        return jointString;
    } 

Source: http://www.dotnetframework.org/default.aspx/4@0/4@0/DEVDIV_TFS/Dev10/Releases/RTMRel/ndp/clr/src/BCL/System/String@cs/1305376/String@cs

UPDATE 2

Good point on the fast allocate. According to an old SO post, you can wrap FastAllocate using reflection (assuming of course you'd cache the fastAllocate method reference so you just called Invoke each time. Perhaps the tradeoff of the call is better than what you're doing now.

var fastAllocate = typeof (string).GetMethods(BindingFlags.NonPublic | BindingFlags.Static)
    .First(x => x.Name == "FastAllocateString");
var newString = (string)fastAllocate.Invoke(null, new object[] {20});
Console.WriteLine(newString.Length); // 20

Perhaps another approach is to use unsafe code to copy your allocation into a char* array, then pass this to the string constructor. The string constructor with char* is an extern passed to the underlying C++ implementation. I haven't found a reliable source for that code to confirm, but perhaps this can be faster for you. The non-prod ready code (no checks for potential overflow, add fixed to lock strings from garbage collection, etc) would start with:

    public unsafe string MyConcat(List<string> values)
    {
        int index = 0;
        int totalLength = values.Sum(m => m.Length);
        char* concat = stackalloc char[totalLength + 1]; // Add additional char for null term
        foreach (var value in values)
        {
            foreach (var c in value)
            {
                concat[index] = c;
                index++;
            }
        }
        concat[index] = '\0';
        return new string(concat);
    }

Now I'm all out of ideas for this :) Perhaps somebody can figure out a method here with marshalling to avoid unsafe code. Since introducing unsafe code requires adding the unsafe flag to compilation, consider adding this piece as a separate dll to minimize your app's security risk if you go down that route.

like image 183
Jason W Avatar answered Oct 12 '22 14:10

Jason W