Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to join strings with a prefix, suffix and separator

Tags:

string

c#

.net

UPDATE

Following Mr Cheese's answer, it seems that the

public static string Join<T>(string separator, IEnumerable<T> values)

overload of string.Join gets its advantage from the use of the StringBuilderCache class.

Does anybody have any feedback on the correctness or reason of this statement?

Could I write my own,

public static string Join<T>(
    string separator,
    string prefix,
    string suffix,
    IEnumerable<T> values)

function that uses the StringBuilderCache class?


After submitting my answer to this question I got drawn into some analysis of which would be the best performing answer.

I wrote this code, in a console Program class to test my ideas.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;

class Program
{
    static void Main()
    {
        const string delimiter = ",";
        const string prefix = "[";
        const string suffix = "]";
        const int iterations = 1000000;

        var sequence = Enumerable.Range(1, 10).ToList();

        Func<IEnumerable<int>, string, string, string, string>[] joiners =
            {
                Build,
                JoinFormat,
                JoinConcat
            };

        // Warmup
        foreach (var j in joiners)
        {
            Measure(j, sequence, delimiter, prefix, suffix, 5);
        }

        // Check
        foreach (var j in joiners)
        {
            Console.WriteLine(
                "{0} output:\"{1}\"",
                j.Method.Name,
                j(sequence, delimiter, prefix, suffix));
        }

        foreach (var result in joiners.Select(j => new
                {
                    j.Method.Name,
                    Ms = Measure(
                        j,
                        sequence,
                        delimiter,
                        prefix,
                        suffix,
                        iterations)
                }))
        {
            Console.WriteLine("{0} time = {1}ms", result.Name, result.Ms);
        }

        Console.ReadKey();
    }

    private static long Measure<T>(
        Func<IEnumerable<T>, string, string, string, string> func,
        ICollection<T> source,
        string delimiter,
        string prefix,
        string suffix,
        int iterations)
    {
        var stopwatch = new Stopwatch();

        stopwatch.Start();
        for (var i = 0; i < iterations; i++)
        {
            func(source, delimiter, prefix, suffix);
        }

        stopwatch.Stop();

        return stopwatch.ElapsedMilliseconds;
    }

    private static string JoinFormat<T>(
        IEnumerable<T> source,
        string delimiter,
        string prefix,
        string suffix)
    {
        return string.Format(
            "{0}{1}{2}",
            prefix,
            string.Join(delimiter, source),
            suffix);
    }

    private static string JoinConcat<T>(
        IEnumerable<T> source,
        string delimiter,
        string prefix,
        string suffix)
    {
        return string.Concat(
            prefix,
            string.Join(delimiter, source),
            suffix);
    }

    private static string Build<T>(
        IEnumerable<T> source,
        string delimiter,
        string prefix,
        string suffix)
    {
        var builder = new StringBuilder();
        builder = builder.Append(prefix);

        using (var e = source.GetEnumerator())
        {
            if (e.MoveNext())
            {
                builder.Append(e.Current);
            }

            while (e.MoveNext())
            {
                builder.Append(delimiter);
                builder.Append(e.Current);
            }
        }

        builder.Append(suffix);
        return builder.ToString();
    }
}

running the code, in release configuration, built with optimizations, from the command line I get output like this.

...

Build time = 1555ms

JoinFormat time = 1715ms

JoinConcat time = 1452ms

The only suprise here (to me) is that the Join-Format combination is the slowest. After considering this answer, this makes a little more sense, the output of the string.Join is being processed by the outer StringBuilder in string.Format, there is an inherent delay with this approach.

After musing, I don't clearly understand how string.Join can be faster. I've read about its use of FastAllocateString() but I don't understand how the buffer can be accurately pre-allocated without calling .ToString() on every member of sequence. Why is the Join-Concat combination faster?

Once I understand that, would it be possible to write my own unsafe string Join function, that takes the extra prefix and suffix parameters and out performs the "safe" alternatives.

I've had several attempts and whilst they work, they are not faster.

like image 796
Jodrell Avatar asked Nov 16 '12 15:11

Jodrell


People also ask

What is the most efficient way to concatenate many strings together?

If you are concatenating a list of strings, then the preferred way is to use join() as it accepts a list of strings and concatenates them and is most readable in this case. If you are looking for performance, append/join is marginally faster there if you are using extremely long strings.

What is string prefix and suffix?

Prefixes and suffixes are special cases of substrings. A prefix of a string is a substring of that occurs at the beginning of ; likewise, a suffix of a string is a substring that occurs at the end of .

How do you prefix a string?

A prefix of a string S is any leading contiguous part of S. A suffix of the string S is any trailing contiguous part of S. For example, "c" and "cod" are prefixes, and "ty" and "ity" are suffixes of the string "codility".

How do you join strings?

The join() method takes all items in an iterable and joins them into one string. A string must be specified as the separator.


1 Answers

To try and answer your original question, I think the answer lies in (the amazing) Reflector tool. You are using collections of objects that are IEnumerable which then also causes the overload of the same type in String.Join method to be called. Interestingly, this function is remarkably similar to your Build function since it enumerates the collection and uses a string builder which means it doesn't need to know the length of all of the strings in advance.

public static string Join<T>(string separator, IEnumerable<T> values)
{

    if (values == null)
    {
        throw new ArgumentNullException("values");
    }
    if (separator == null)
    {
        separator = Empty;
    }
    using (IEnumerator<T> enumerator = values.GetEnumerator())
    {
        if (!enumerator.MoveNext())
        {
            return Empty;
        }
        StringBuilder sb = StringBuilderCache.Acquire(0x10);
        if (enumerator.Current != null)
        {
            string str = enumerator.Current.ToString();
            if (str != null)
            {
                sb.Append(str);
            }
        }
        while (enumerator.MoveNext())
        {
            sb.Append(separator);
            if (enumerator.Current != null)
            {
                string str2 = enumerator.Current.ToString();
                if (str2 != null)
                {
                    sb.Append(str2);
                }
            }
        }
        return StringBuilderCache.GetStringAndRelease(sb);
    }
}

It seems to be doing something with cached StringBuilders which I don't fully understand but it's probably why it's faster due to some internal optimisation. As I'm working on a laptop I may have been caught out by power management state changes before so I've rerun the code with the 'BuildCheat' method (avoids the string builder buffer capacity doubling) included and the times are remarkably close to String.Join(IEnumerable) (also ran outside of the debugger).

Build time = 1264ms

JoinFormat = 1282ms

JoinConcat = 1108ms

BuildCheat = 1166ms

private static string BuildCheat<T>(
    IEnumerable<T> source,
    string delimiter,
    string prefix,
    string suffix)
{
    var builder = new StringBuilder(32);
    builder = builder.Append(prefix);

    using (var e = source.GetEnumerator())
    {
        if (e.MoveNext())
        {
            builder.Append(e.Current);
        }

        while (e.MoveNext())
        {
            builder.Append(delimiter);
            builder.Append(e.Current);
        }
    }

    builder.Append(suffix);
    return builder.ToString();
}

The answer the final part of your question is where you mention the use of FastAllocateString but as you can see, it's not called above in the overloaded method that passes IEnumerable, it's only called when it's working directly with strings and it most definitely does loop through the array of strings to sum up their lengths prior to creating the final output.

public static unsafe string Join(string separator, string[] value, int startIndex, int count)
{
    if (value == null)
    {
        throw new ArgumentNullException("value");
    }
    if (startIndex < 0)
    {
        throw new ArgumentOutOfRangeException("startIndex", Environment.GetResourceString("ArgumentOutOfRange_StartIndex"));
    }
    if (count < 0)
    {
        throw new ArgumentOutOfRangeException("count", Environment.GetResourceString("ArgumentOutOfRange_NegativeCount"));
    }
    if (startIndex > (value.Length - count))
    {
        throw new ArgumentOutOfRangeException("startIndex", Environment.GetResourceString("ArgumentOutOfRange_IndexCountBuffer"));
    }
    if (separator == null)
    {
        separator = Empty;
    }
    if (count == 0)
    {
        return Empty;
    }
    int length = 0;
    int num2 = (startIndex + count) - 1;
    for (int i = startIndex; i <= num2; i++)
    {
        if (value[i] != null)
        {
            length += value[i].Length;
        }
    }
    length += (count - 1) * separator.Length;
    if ((length < 0) || ((length + 1) < 0))
    {
        throw new OutOfMemoryException();
    }
    if (length == 0)
    {
        return Empty;
    }
    string str = FastAllocateString(length);
    fixed (char* chRef = &str.m_firstChar)
    {
        UnSafeCharBuffer buffer = new UnSafeCharBuffer(chRef, length);
        buffer.AppendString(value[startIndex]);
        for (int j = startIndex + 1; j <= num2; j++)
        {
            buffer.AppendString(separator);
            buffer.AppendString(value[j]);
        }
    }
    return str;
}

Just out of interest I changed your program to not use generics and made JoinFormat and JoinConcat accept a simple array of strings (I couldn't readily change Build since it uses an enumerator), so String.Join uses the other implementation above. The results are pretty impressive:

JoinFormat time = 386ms

JoinConcat time = 226ms

Perhaps you can find a solution that makes the best of fast string arrays whilst also using generic inputs...

like image 173
Mr Cheese Avatar answered Sep 26 '22 01:09

Mr Cheese