Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex replacements inside a StringBuilder

I'm writing the contents of a text file to a StringBuilder and I then want to perform a number of find/replace actions on the text contained in the StringBuilder using regular expressions.

I've run into a problem as the StringBuilder replace function is not capable of accepting regular expression arguments.

I could use Regex.Replace on a normal string but I'm under the impression that this is inefficient due to the fact that two copies of the string will need to be created in memory as .net strings are immutable.

Once I've updated the text I plan to write it back to the original file.

What's the best and most efficient way to solve my problem?

EDIT

In addition to the answer(s) below, I've found the following questions that also shed some light on my problem -

  • memory-efficiency-and-performance-of-string-replace-net-framework
  • is-stringbuilder-replace-more-efficient-than-string-replace
  • at-what-point-does-using-a-stringbuilder-become-insignificant-or-an-overhead
like image 208
ipr101 Avatar asked Aug 17 '10 16:08

ipr101


People also ask

How do I replace a character in StringBuilder?

The replace(int start, int end, String str) method of StringBuilder class is used to replace the characters in a substring of this sequence with characters in the specified String.

Can you replace with regex?

When you want to search and replace specific patterns of text, use regular expressions. They can help you in pattern matching, parsing, filtering of results, and so on. Once you learn the regex syntax, you can use it for almost any language. Press Ctrl+R to open the search and replace pane.

Does Python string replace support regex?

To replace a string in Python, the regex sub() method is used. It is a built-in Python method in re module that returns replaced string. Don't forget to import the re module. This method searches the pattern in the string and then replace it with a new given expression.


3 Answers

The best and most efficient solution for your time is to try the simplest approach first: forget the StringBuilder and just use Regex.Replace. Then find out how slow it is - it may very well be good enough. Don't forget to try the regex in both compiled and non-compiled mode.

If that isn't fast enough, consider using a StringBuilder for any replacements you can express simply, and then use Regex.Replace for the rest. You might also want to consider trying to combine replacements, reducing the number of regexes (and thus intermediate strings) used.

like image 172
Jon Skeet Avatar answered Oct 18 '22 20:10

Jon Skeet


You have 3 options:

  1. Do this in an inefficient way with strings as others have recommended here.

  2. Use the .Matches() call on your Regex object, and emulate the way .Replace() works (see #3).

  3. Adapt the Mono implementation of Regex to build a Regex that accepts StringBuilder. Almost all of the work is already done for you in Mono, but it will take time to suss out the parts that make it work into their own library. Mono's Regex leverages Novell's 2002 JVM implementation of Regex, oddly enough.

Expanding on the above:

2. Emulate Replace()

You can mimic LTRReplace's behavior by calling .Matches(), tracking where you are in the original string, and looping:

var matches = regex.Matches(original);
var sb = new StringBuilder(original.Length);
int pos = 0; // position in original string
foreach(var match in matches)
{
    // Append the portion of the original we skipped
    sb.Append(original.Substring(pos, match.Index));
    pos = match.Index;

    // Make any operations you like on the match result, like your own custom Replace, or even run another Regex

    pos += match.Value.Length;
}
sb.Append(original.Substring(pos, original.Length - 1));

But, this only saves you some strings - the Mono approach is the only one that really eliminates strings outright.

3. Mono

This answer has been sitting out since 2014, and I never saw a StringBuilder based Regex land either here in the comments or in searching. So, just to get the ball rolling I extracted the Regex impl from Mono and put it here:

https://github.com/brass9/RegexStringBuilder

I then created an interface IString to allow the inputs and outputs to be more loosely passed - with string, StringBuilder and char[] each wrapped in a class that implements IString.

The result is not fast - Microsoft's highly optimized code runs 10,000 simple replaces ~6x faster than this code. But, I've done nothing to optimize it, especially around eliminating strings deeper in the underlying code (it casts to string in some cases to run .ToLower() only to go back to char arrays).

Contributions welcome. A discussion of how the code worked in Mono from 2014 (shortly before it was removed from Mono, for Microsoft's string-based implementation) is below:

System.Text.RegularExpressions.Regex uses an RxCompiler to instantiate an IMachineFactory in the form of an RxInterpreterFactory, which unsurprisingly makes IMachines as RxInterpreters. Getting those to emit is most of what you need to do, although if you're just looking to learn how it's all structured for efficiency, it's notable much of what you're looking for is in its base class, BaseMachine.

In particular, in BaseMachine is the StringBuilder-based stuff. In the method LTRReplace, it first instantiates a StringBuilder with the initial string, and everything from there on out is purely StringBuilder-based. It's actually very annoying that Regex doesn't have StringBuilder methods hanging out, if we assume the internal Microsoft .Net implementation is similar.

like image 21
Chris Moschini Avatar answered Oct 18 '22 21:10

Chris Moschini


I'm not sure if this helps your scenario or not, but I ran into some memory consumption ceilings with Regex and I needed a simple wildcard replacement extension method on a StringBuilder to push past it. If you need complex Regex matching and/or backreferences, this won't do, but if simple * or ? wildcard replacements (with literal "replace" text) would get the job done for you, then the workaround at the end of my question here should at least give you a boost:

Has anyone implemented a Regex and/or Xml parser around StringBuilders or Streams?

like image 2
Paul Smith Avatar answered Oct 18 '22 22:10

Paul Smith