Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to replace multiple strings in a huge string

I'm looking for the fastest way to replace multiple (~500) substrings of a big (~1mb) string. Whatever I have tried it seems that String.Replace is the fastest way of doing it.

I just care about the fastest possible way. Not code readability, maintainability etc. I don't care if I need to use unsafe code or pre-process the original string either.

Each replace iteration will replace ABC on the string with some other string (different per replace iteration). The string to replace will ALWAYS be the same - ABC will always be ABC. Never ABD. So if there are 400.000 thousands replace iterations. The same string - ABC - will be replaced with some other (different) string each time.

I can be in control of what ABC is. I can make it super-short or super-long as long as it doesn't affect the results. Clearly ABC can't be hello cause hello will exist as a word in most of the input strings.

Example input: ABCDABCABCDABCABCDABCABCDABCD

Example replace from string: BC

Example replace with strings: AA, BB, CC, DD, EE (5 iterations)

Example outputs:

AAADAAAAAADAAAAAADAAAAAADAAAD ABBDABBABBDABBABBDABBABBDABBD ACCDACCACCDACCACCDACCACCDACCD ADDDADDADDDADDADDDADDADDDADDD AEEDAEEAEEDAEEAEEDAEEAEEDAEED 

Average case: Input string is 100-200kb with 40.000 replace iterations. Worst case: Input string is 1-2mb with 400.000 replace iterations.

I can do ANYTHING. Do it in parallel, do it unsafe, etc. It doesn't matter how I do it. What matters is that it needs to be as fast as it gets.

like image 790
Yannis Avatar asked Nov 26 '13 15:11

Yannis


People also ask

How do you replace multiple values in a string?

Show activity on this post. var str = "I have a cat, a dog, and a goat."; str = str. replace(/goat/i, "cat"); // now str = "I have a cat, a dog, and a cat." str = str. replace(/dog/i, "goat"); // now str = "I have a cat, a goat, and a cat." str = str.


2 Answers

Using unsafe and compiled as x64

result:

Implementation       | Exec   | GC #1 Simple            | 4706ms |  0ms #2 Simple parallel   | 2265ms |  0ms #3 ParallelSubstring |  800ms | 21ms #4 Fredou unsafe     |  432ms | 15ms 

take the code of Erti-Chris Eelmaa and replace my previous one with this.

I don't think I will do another iteration but i did learn a few thing with unsafe which is a good thing :-)

    private unsafe static void FredouImplementation(string input, int inputLength, string replace, string[] replaceBy)     {         var indexes = new List<int>();          //input = "ABCDABCABCDABCABCDABCABCDABCD";         //inputLength = input.Length;         //replaceBy = new string[] { "AA", "BB", "CC", "DD", "EE" };          //my own string.indexof to save a few ms         int len = inputLength;          fixed (char* i = input, r = replace)         {             int replaceValAsInt = *((int*)r);              while (--len > -1)             {                 if (replaceValAsInt == *((int*)&i[len]))                 {                     indexes.Add(len--);                 }             }                         }          var idx = indexes.ToArray();         len = indexes.Count;          Parallel.For(0, replaceBy.Length, l =>             Process(input, inputLength, replaceBy[l], idx, len)         );     }      private unsafe static void Process(string input, int len, string replaceBy, int[] idx, int idxLen)     {         var output = new char[len];          fixed (char* o = output, i = input, r = replaceBy)         {             int replaceByValAsInt = *((int*)r);              //direct copy, simulate string.copy             while (--len > -1)             {                 o[len] = i[len];             }              while (--idxLen > -1)             {                 ((int*)&o[idx[idxLen]])[0] = replaceByValAsInt;             }         }          //Console.WriteLine(output);     } 
like image 191
Fredou Avatar answered Oct 16 '22 05:10

Fredou


I had a similar issue on a project and I've implemented a Regex solution to perform multiple and case insensitive replacements on a file.

For efficiency purposes, I set criteria to go through the original string only once.

I've published a simple console app to test some strategies on https://github.com/nmcc/Spikes/tree/master/StringMultipleReplacements

The code for the Regex solution is similar to this:

Dictionary<string, string> replacements = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);     // Fill the dictionary with the proper replacements:          StringBuilder patternBuilder = new StringBuilder();                 patternBuilder.Append('(');                  bool firstReplacement = true;                  foreach (var replacement in replacements.Keys)                 {                     if (!firstReplacement)                         patternBuilder.Append('|');                     else                         firstReplacement = false;                      patternBuilder.Append('(');                     patternBuilder.Append(Regex.Escape(replacement));                     patternBuilder.Append(')');                 }                 patternBuilder.Append(')');                  var regex = new Regex(patternBuilder.ToString(), RegexOptions.IgnoreCase);                  return regex.Replace(sourceContent, new MatchEvaluator(match => replacements[match.Groups[1].Value])); 

EDIT: The execution times running the test application on my computer are:

  • Looping through the replacements calling string.Substring() (CASE SENSITIVE): 2ms
  • Single pass using Regex with multiple replacements at once (Case insensitive): 8ms
  • Looping through replacements using a ReplaceIgnoreCase Extension (Case insensitive): 55ms
like image 36
nunoc Avatar answered Oct 16 '22 04:10

nunoc