How do I split a string by strings and include the delimiters using .NET?

Tags:

There are many similar questions, but apparently no perfect match, that's why I'm asking.

I'd like to split a random string (e.g. 123xx456yy789) by a list of string delimiters (e.g. xx, yy) and include the delimiters in the result (here: 123, xx, 456, yy, 789).

Good performance is a nice bonus. Regex should be avoided, if possible.

Update: I did some performance checks and compared the results (too lazy to formally check them though). The tested solutions are (in random order):

Gabe
Guffa
Mafu
Regex

Other solutions were not tested because either they were similar to another solution or they came in too late.

This is the test code:

class Program {     private static readonly List<Func<string, List<string>, List<string>>> Functions;     private static readonly List<string> Sources;     private static readonly List<List<string>> Delimiters;      static Program ()     {         Functions = new List<Func<string, List<string>, List<string>>> ();         Functions.Add ((s, l) => s.SplitIncludeDelimiters_Gabe (l).ToList ());         Functions.Add ((s, l) => s.SplitIncludeDelimiters_Guffa (l).ToList ());         Functions.Add ((s, l) => s.SplitIncludeDelimiters_Naive (l).ToList ());         Functions.Add ((s, l) => s.SplitIncludeDelimiters_Regex (l).ToList ());          Sources = new List<string> ();         Sources.Add ("");         Sources.Add (Guid.NewGuid ().ToString ());          string str = "";         for (int outer = 0; outer < 10; outer++) {             for (int i = 0; i < 10; i++) {                 str += i + "**" + DateTime.UtcNow.Ticks;             }             str += "-";         }         Sources.Add (str);          Delimiters = new List<List<string>> ();         Delimiters.Add (new List<string> () { });         Delimiters.Add (new List<string> () { "-" });         Delimiters.Add (new List<string> () { "**" });         Delimiters.Add (new List<string> () { "-", "**" });     }      private class Result     {         public readonly int FuncID;         public readonly int SrcID;         public readonly int DelimID;         public readonly long Milliseconds;         public readonly List<string> Output;          public Result (int funcID, int srcID, int delimID, long milliseconds, List<string> output)         {             FuncID = funcID;             SrcID = srcID;             DelimID = delimID;             Milliseconds = milliseconds;             Output = output;         }          public void Print ()         {             Console.WriteLine ("S " + SrcID + "\tD " + DelimID + "\tF " + FuncID + "\t" + Milliseconds + "ms");             Console.WriteLine (Output.Count + "\t" + string.Join (" ", Output.Take (10).Select (x => x.Length < 15 ? x : x.Substring (0, 15) + "...").ToArray ()));         }     }      static void Main (string[] args)     {         var results = new List<Result> ();          for (int srcID = 0; srcID < 3; srcID++) {             for (int delimID = 0; delimID < 4; delimID++) {                 for (int funcId = 3; funcId >= 0; funcId--) { // i tried various orders in my tests                     Stopwatch sw = new Stopwatch ();                     sw.Start ();                      var func = Functions[funcId];                     var src = Sources[srcID];                     var del = Delimiters[delimID];                      for (int i = 0; i < 10000; i++) {                         func (src, del);                     }                     var list = func (src, del);                     sw.Stop ();                      var res = new Result (funcId, srcID, delimID, sw.ElapsedMilliseconds, list);                     results.Add (res);                     res.Print ();                 }             }         }     } }

As you can see, it was really just a quick and dirty test, but I ran the test multiple times and with different order and the result was always very consistent. The measured time frames are in the range of milliseconds up to seconds for the larger datasets. I ignored the values in the low-millisecond range in my following evaluation because they seemed negligible in practice. Here's the output on my box:

S 0     D 0     F 3     11ms 1 S 0     D 0     F 2     7ms 1 S 0     D 0     F 1     6ms 1 S 0     D 0     F 0     4ms 0 S 0     D 1     F 3     28ms 1 S 0     D 1     F 2     8ms 1 S 0     D 1     F 1     7ms 1 S 0     D 1     F 0     3ms 0 S 0     D 2     F 3     30ms 1 S 0     D 2     F 2     8ms 1 S 0     D 2     F 1     6ms 1 S 0     D 2     F 0     3ms 0 S 0     D 3     F 3     30ms 1 S 0     D 3     F 2     10ms 1 S 0     D 3     F 1     8ms 1 S 0     D 3     F 0     3ms 0 S 1     D 0     F 3     9ms 1       9e5282ec-e2a2-4... S 1     D 0     F 2     6ms 1       9e5282ec-e2a2-4... S 1     D 0     F 1     5ms 1       9e5282ec-e2a2-4... S 1     D 0     F 0     5ms 1       9e5282ec-e2a2-4... S 1     D 1     F 3     63ms 9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1     D 1     F 2     37ms 9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1     D 1     F 1     29ms 9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1     D 1     F 0     22ms 9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1     D 2     F 3     30ms 1       9e5282ec-e2a2-4... S 1     D 2     F 2     10ms 1       9e5282ec-e2a2-4... S 1     D 2     F 1     10ms 1       9e5282ec-e2a2-4... S 1     D 2     F 0     12ms 1       9e5282ec-e2a2-4... S 1     D 3     F 3     73ms 9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1     D 3     F 2     40ms 9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1     D 3     F 1     33ms 9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1     D 3     F 0     30ms 9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 2     D 0     F 3     10ms 1       0**634226552821... S 2     D 0     F 2     109ms 1       0**634226552821... S 2     D 0     F 1     5ms 1       0**634226552821... S 2     D 0     F 0     127ms 1       0**634226552821... S 2     D 1     F 3     184ms 21      0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226 552821... - 0**634226552821... - S 2     D 1     F 2     364ms 21      0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226 552821... - 0**634226552821... - S 2     D 1     F 1     134ms 21      0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226 552821... - 0**634226552821... - S 2     D 1     F 0     517ms 20      0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226 552821... - 0**634226552821... - S 2     D 2     F 3     688ms 201     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2     D 2     F 2     2404ms 201     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2     D 2     F 1     874ms 201     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2     D 2     F 0     717ms 201     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2     D 3     F 3     1205ms 221     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2     D 3     F 2     3471ms 221     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2     D 3     F 1     1008ms 221     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2     D 3     F 0     1095ms 220     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... **

I compared the results and this is what I found:

All 4 functions are fast enough for common usage.
The naive version (aka what I wrote initially) is the worst in terms of computation time.
Regex is a bit slow on small datasets (probably due to initialization overhead).
Regex does well on large data and hits a similar speed as the non-regex solutions.
The performance-wise best seems to be Guffa's version overall, which is to be expected from the code.
Gabe's version sometimes omits an item, but I did not investigate this (bug?).

To conclude this topic, I suggest to use Regex, which is reasonably fast. If performance is critical, I'd prefer Guffa's implementation.

886

asked Mar 20 '10 21:03

mafu

1 Answers

Despite your reluctance to use regex it actually nicely preserves the delimiters by using a group along with the Regex.Split method:

string input = "123xx456yy789"; string pattern = "(xx|yy)"; string[] result = Regex.Split(input, pattern);

If you remove the parentheses from the pattern, using just "xx|yy", the delimiters are not preserved. Be sure to use Regex.Escape on the pattern if you use any metacharacters that hold special meaning in regex. The characters include \, *, +, ?, |, {, [, (,), ^, $,., #. For instance, a delimiter of . should be escaped \.. Given a list of delimiters, you need to "OR" them using the pipe | symbol and that too is a character that gets escaped. To properly build the pattern use the following code (thanks to @gabe for pointing this out):

var delimiters = new List<string> { ".", "xx", "yy" }; string pattern = "(" + String.Join("|", delimiters.Select(d => Regex.Escape(d))                                                   .ToArray())                   + ")";

The parentheses are concatenated rather than included in the pattern since they would be incorrectly escaped for your purposes.

EDIT: In addition, if the delimiters list happens to be empty, the final pattern would incorrectly be () and this would cause blank matches. To prevent this a check for the delimiters can be used. With all this in mind the snippet becomes:

string input = "123xx456yy789"; // to reach the else branch set delimiters to new List(); var delimiters = new List<string> { ".", "xx", "yy", "()" };  if (delimiters.Count > 0) {     string pattern = "("                      + String.Join("|", delimiters.Select(d => Regex.Escape(d))                                                   .ToArray())                      + ")";     string[] result = Regex.Split(input, pattern);     foreach (string s in result)     {         Console.WriteLine(s);     } } else {     // nothing to split     Console.WriteLine(input); }

If you need a case-insensitive match for the delimiters use the RegexOptions.IgnoreCase option: Regex.Split(input, pattern, RegexOptions.IgnoreCase)

EDIT #2: the solution so far matches split tokens that might be a substring of a larger string. If the split token should be matched completely, rather than part of a substring, such as a scenario where words in a sentence are used as the delimiters, then the word-boundary \b metacharacter should be added around the pattern.

For example, consider this sentence (yea, it's corny): "Welcome to stackoverflow... where the stack never overflows!"

If the delimiters were { "stack", "flow" } the current solution would split "stackoverflow" and return 3 strings { "stack", "over", "flow" }. If you needed an exact match, then the only place this would split would be at the word "stack" later in the sentence and not "stackoverflow".

To achieve an exact match behavior alter the pattern to include \b as in \b(delim1|delim2|delimN)\b:

string pattern = @"\b("                 + String.Join("|", delimiters.Select(d => Regex.Escape(d)))                 + @")\b";

Finally, if trimming the spaces before and after the delimiters is desired, add \s* around the pattern as in \s*(delim1|delim2|delimN)\s*. This can be combined with \b as follows:

string pattern = @"\s*\b("                 + String.Join("|", delimiters.Select(d => Regex.Escape(d)))                 + @")\b\s*";

190

answered Oct 05 '22 12:10

Ahmad Mageed

Related questions
                            
                                Entity Framework Code First - Changing a Table Name
                            
                                HTTPError Exception Message not displaying when webapi is run on Server vs being run locally
                            
                                Stop displaying entire stack trace in WebAPI
                            
                                Not able to reference Image source with relative path in xaml
                            
                                Ref parameters and reflection
                            
                                Pass and execute delegate in separate AppDomain
                            
                                How can I loop through Items in the Item Template from an asp:Repeater?
                            
                                Why double.TryParse("0.0000", out doubleValue) returns false ?
                            
                                What optimization hints can I give to the compiler/JIT?
                            
                                C# String.Substring equivalent for StringBuilder?
                            
                                Error calling Stored Procedures from EntityFramework
                            
                                The entity type 'Microsoft.AspNet.Identity.EntityFramework.IdentityUserLogin<string>' requires a key to be defined
                            
                                Disable compiler optimisation for a specific function or block of code (C#)
                            
                                C#, function to replace all html special characters with normal text characters
                            
                                ToString on null string
                            
                                Storing TimeSpan with Entity Framework Codefirst - SqlDbType.Time overflow
                            
                                Install Nuget Package error "The path is not of a legal form"
                            
                                C# IEnumerator/yield structure potentially bad?
                            
                                Can I disable the printing page x of y dialog?
                            
                                Looking for a very simple Cache example

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I split a string by strings and include the delimiters using .NET?

Tags:

string

c#

.net

mafu

People also ask

1 Answers

Ahmad Mageed

Recent Activity

Donate For Us