I need to mark up a string with identifiers indicating the start and end of a substring that has passed a test.
Assume I had the string "The quick brown fox jumps over the lazy dog" and I wanted to markup the string with a tag for every word starting with the characters 'b' and 'o'. The final string would look like "The quick <tag>brown</tag> fox jumps <tag>over</tag> the lazy dog".
Using a combination of regular expressions and LINQ I have the correct logic to accomplish what I want but my performance is not what I want it to be because I am using String.Insert to insert the tags. Our strings can be very long (>200k) and the number of substrings to tag can be close to a hundred. Below is the code I am using to insert the tags. Given I know the start and length of each substring how can I update the string 'input' faster?
.ForEach<Match>(m => {
input = input.Insert(m.Index + m.Length, "</tag>");
input = input.Insert(m.Index, "<tag>");
});
You should use a StringBuilder.
For optimal performance, set the StringBuilder's capacity before doing anything, then append chunks of the original string between tags.
Alternatively, move your logic to a MatchEvaluator lambda expression and call RegeEx.Replace.
Try this:
Regex.Replace("The quick brown fox jumps over the lazy dog", @"(^|\s)([bo]\w*)", "$1<tag>$2</tag>");
The quick <tag>brown</tag> fox jumps <tag>over</tag> the lazy dog
Regular expressions should provide with a fairly quick replacement. Whether or not this method is the best depends on the length of the string and how much work is involved to actually match one of your "words."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With