Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reason for 11 overloads of string.Concat

Tags:

string

c#

I just noticed that there are 11 overloads of the method string.Concat()

public static string Concat(IEnumerable<string> values);
public static string Concat<T>(IEnumerable<T> values);
public static string Concat(object arg0);
public static string Concat(params object[] args);
public static string Concat(params string[] values);
public static string Concat(object arg0, object arg1);
public static string Concat(string str0, string str1);
public static string Concat(object arg0, object arg1, object arg2);
public static string Concat(string str0, string str1, string str2);
public static string Concat(object arg0, object arg1, object arg2, object arg3);
public static string Concat(string str0, string str1, string str2, string str3);

What is the reason for that? The both

public static string Concat(params object[] args);
public static string Concat<T>(IEnumerable<T> values);

should be the only needed because they are the same convenient/powerful. MSDN doesn't give a answer on that and if you remove the 9 "duplicate" overloads from the framework, noone would notice.

like image 643
Toshi Avatar asked Dec 02 '22 14:12

Toshi


2 Answers

The primary motivator of this implementation decision is performance.

As you correctly note, there could be only two:

public static string Concat(params object[] args);
public static string Concat<T>(IEnumerable<T> values);

And if C# implemented the "params enumerable" feature -- whereby a variadic method could have an IEnumerable<T> instead of a T[] as the expanded parameter -- then we could get down to only one. Or, we could lose the enumerable overload and just use the object array version.

Suppose we did the latter. You say

string x = Foo();
string y = Bar();
string z = x + y;

and what happens? In a world with only variadic ToString this could only be codegen'd as

string x = Foo();
string y = Bar();
object[] array = new string[2];
array[0] = x;
array[1] = y;
string z = string.Concat(array);

So: let's review. Presumably the calls allocate one string each. Then we allocate a short-lived array, copy references over to it, pass that to the variadic method, etc. That method needs to be written to handle an array of any size, to handle a null array, and so on.

We've not only added new short-lived garbage to the gen zero heap; we've also created two new edges in the liveness analysis graph that might have to be traversed. We may have either decreased the amount of time between collections by adding pressure, or increased the cost of collections by adding edges, or, most likely, collections become both more frequent and more expensive: a double-whammy.

But wait, there's more. We must consider what the implementation of the called Concat method looks like.

The object array is -- surprise -- an array of objects, not an array of strings. So what do we need to do? The callee needs to convert each to a string. By calling ToString on each? No, that might crash. By checking for null first, and then calling ToString.

We've passed in strings, but the callee doesn't know that. ToString is an identity for strings, but the compiler doesn't know that, and the call is virtualized so the jitter can't easily optimize it away either. So we've taken on another unnecessary few nanoseconds of checks and indirections. Not to mention that we needed to check the array for null, get the length of the array, make a loop to loop over each element of the array, and so on.

These expenses are very small but they are per concatenation, and they could add up to real time spent and memory pressure.

A great many programs have their performance gated upon string manipulations and memory pressure. How can we eliminate or mitigate these costs?

We could observe that most string concatenations are two strings, so it makes sense to create an overload specifically to handle that situation:

static string Concat(string, string)

Now we can codegen the fragment above as:

string x = Foo();
string y = Bar();
string z = string.Concat(x, y);

Now there is no array created, so there is no extra garbage created, no collection pressure, no new edges in the reference graph. In the callee, the strings need to be checked for nullity but we don't need to call ToString in the implementation because we have the type system to enforce that the operands are already strings, we don't need to check an array for nullity, we don't need a loop variable being checked against the array length, and so on.

Thus we have a good justification for having two overloads: one that takes a params array, and one that takes exactly two strings.

And now we repeat that analysis for another scenario that is common and could be more performant. Each additional overload is designed to produce a more efficient alternative for a common scenario. As more common scenarios are identified that can be made faster and less resource intensive, there is incentive to produce more overloads, and to fix compilers so that they generate code that takes advantage of these overloads. The net result is about a dozen seemingly redundant overloads, each of which can be tuned for high performance; these cover the cases most commonly seen in real programs.

If this subject interests you, I've written a short series of articles on how I redesigned the string concatenation optimizer back in 2006.

https://ericlippert.com/2013/06/17/string-concatenation-behind-the-scenes-part-one/

like image 109
Eric Lippert Avatar answered Dec 26 '22 10:12

Eric Lippert


The (IEnumerable<String>) and (IEnumerable<T>) overloads are not equivalent.

  • IEnumerable<String> means that any string sequence/list can be used directly without needing to incur any runtime cast or string conversion expense, and by using this overload the caller can be assured of this.
  • IEnumerable<T> - would also be the overload for Object[] or any sequence of varying types).

While IEnumerable<T> is more general, unfortunately C# support for variadic arguments using params means it has to be typed as an array (e.g. String[] or Object[]), so that has to be added as a third and forth overload respectively.

It's true that the params Object[] and params String[] versions can correctly substitute for the Object arg0, Object arg1 and String arg0, String arg1 overloads, but the use of params means a new array will be allocated at runtime, which is suboptimal, especially in a tight-loop situation where you want to minimize allocations; so as an optimization if you only have 1, 2, 3 or 4 arguments (which is probably 95% of the time) then those parameters can be passed on the stack.

Coincidentally, I asked a similar question (about params arguments) a few weeks ago: Will `params` in C# always cause a new array to be allocated on every call? - which also discusses why there's a large number of overloads.

What would be nice, is if the C# compiler supported params IEnumerable (given that IEnumerable already has privileged status in C#), and if the JIT would support stack-based variadic parameters instead of using an array on the heap.

like image 27
Dai Avatar answered Dec 26 '22 09:12

Dai