I have a method that uses Regex
to find a pattern within a text string
. It works, but isn't adequate going forward because it requires the text to appear in the exact order, rather than viewing the phrase as a set of words.
public static string HighlightExceptV1(this string text, string wordsToExclude)
{
// Original version
// wordsToExclude usually consists of a 1, 2 or 3 word term.
// The text must be in a specific order to work.
var pattern = $@"(\s*\b{wordsToExclude}\b\s*)";
// Do something to string...
}
This version improves upon the previous version, in that it does allow the words to be matched in any order, but it causes some spacing issues in the final output because the spacing is removed and replaced with the pipes.
public static string HighlightExceptV2(this string text, string wordsToExclude)
{
// This version allows the words to be matched in any order, but it has
// flaws, in that the natural spacing is removed in some cases.
var words = wordsToExclude.Replace(' ', '|');
var pattern = $@"(\s*\b{words}\b\s*)";
// Example phase: big blue widget
// Example output: $@"(\s*\bbig|blue|widget\b\s*)"
// Do something to string...
}
Ideally, the spacing needs to be preserved around each word. The pseudo example below shows what I'm trying to do.
rejoin the word patterns to produce the pattern that will be used to match
public static string HighlightExceptV3(this string text, string wordsToExclude)
{
// The outputted pattern must be dynamic due to unknown
// words in phrase.
// Example phrase: big blue widgets
var words = wordsToExclude.Replace(' ', '|');
// Example: big|blue|widget
// The code below isn't complete - merely an example
// of the required output.
var wordPattern = $@"\s*\b{word}\b\s*";
// Example: $@"\s*\bwidget\b\s*"
var phrasePattern = "$({rejoinedArray})";
// @"(\s*\bbig\b\s*|\s*\bblue\b\s*|\s*\bwidget\b\s*)";
// Do something to string...
}
Note: There could be better ways of dealing with the word boundaries spacing but I'm not a regex expert.
I'm looking for some help/advice to take the split array, wrap it , then rejoin it in the neatest way.
You need to enclose all your alternatives within a non-capturing group, (?:...|...)
. Besides, to further counter eventual issues, I suggest replacing word boundaries with their lookaround unambiguous equivalents, (?<!\w)...(?!\w)
.
Here is a working C# snippet:
var text = "there are big widgets in this phrase blue widgets too";
var words = "big blue widgets";
var pattern = $@"(\s*(?<!\w)(?:{string.Join("|", words.Split(' ').Select(Regex.Escape))})(?!\w)\s*)";
var result = string.Concat(Regex.Split(text, pattern, RegexOptions.IgnoreCase).Select((str, index) =>
index % 2 == 0 && !string.IsNullOrWhiteSpace(str) ? $"<b>{str}</b>" : str));
Console.WriteLine(result);
NOTES
words.Split(' ').Select(Regex.Escape)
- splits the words
text with spaces and regex-escapes each itemstring.Join("|",...)
re-builds the string inserting |
between the items(?<!\w)
negative lookbehind matches a location that is not immediately preceded with a word char, and (?!\w)
negative lookahead matches a location that is not immediately followed with a word char.I suggest implementing FSM (Finite State Machine) with 2
states (in and out selection) and Regex.Replace
(we can keep the word as it is - word
or replace it with <b>word
, word<\b>
or <b>word<\b>
)
private static string MyModify(string text, string wordsToExclude) {
HashSet<string> exclude = new HashSet<string>(
wordsToExclude.Split(' '), StringComparer.OrdinalIgnoreCase);
bool inSelection = false;
string result = Regex.Replace(text, @"[\w']+", match => {
var next = match.NextMatch();
if (inSelection) {
if (next.Success && exclude.Contains(next.Value)) {
inSelection = false;
return match.Value + "</b>";
}
else
return match.Value;
}
else {
if (exclude.Contains(match.Value))
return match.Value;
else if (next.Success && exclude.Contains(next.Value))
return "<b>" + match.Value + "</b>";
else {
inSelection = true;
return "<b>" + match.Value;
}
}
});
if (inSelection)
result += "</b>";
return result;
}
Demo:
string wordsToExclude = "big widgets blue if";
string[] tests = new string[] {
"widgets for big blue",
"big widgets are great but better if blue",
"blue",
"great but expensive",
"big and small, blue and green",
};
string report = string.Join(Environment.NewLine, tests
.Select(test => $"{test,-40} -> {MyModify(test, wordsToExclude)}"));
Console.Write(report);
Outcome:
widgets for big blue -> widgets <b>for</b> big blue
big widgets are great but better if blue -> big widgets <b>are great but better</b> if blue
blue -> blue
great but expensive -> <b>great but expensive</b>
big and small, blue and green -> big <b>and small</b>, blue <b>and green</b>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With