Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string into words and rejoin with additional data

I have a method that uses Regex to find a pattern within a text string. It works, but isn't adequate going forward because it requires the text to appear in the exact order, rather than viewing the phrase as a set of words.

    public static string HighlightExceptV1(this string text, string wordsToExclude)
    {
        // Original version
        // wordsToExclude usually consists of a 1, 2 or 3 word term.
        // The text must be in a specific order to work.

        var pattern = $@"(\s*\b{wordsToExclude}\b\s*)";

        // Do something to string...
    }

This version improves upon the previous version, in that it does allow the words to be matched in any order, but it causes some spacing issues in the final output because the spacing is removed and replaced with the pipes.

    public static string HighlightExceptV2(this string text, string wordsToExclude)
    {
        // This version allows the words to be matched in any order, but it has
        // flaws, in that the natural spacing is removed in some cases.
        var words = wordsToExclude.Replace(' ', '|');

        var pattern = $@"(\s*\b{words}\b\s*)";

        // Example phase: big blue widget
        // Example output: $@"(\s*\bbig|blue|widget\b\s*)"

        // Do something to string...
    }

Ideally, the spacing needs to be preserved around each word. The pseudo example below shows what I'm trying to do.

  1. split the original phrase into words
  2. wrap each word within a regex pattern that will preserve the space when matched
  3. rejoin the word patterns to produce the pattern that will be used to match

    public static string HighlightExceptV3(this string text, string wordsToExclude)
    {
        // The outputted pattern must be dynamic due to unknown
        // words in phrase.
    
        // Example phrase: big blue widgets
    
        var words = wordsToExclude.Replace(' ', '|');
        // Example: big|blue|widget
    
        // The code below isn't complete - merely an example
        // of the required output.
    
        var wordPattern = $@"\s*\b{word}\b\s*";
        // Example: $@"\s*\bwidget\b\s*"
    
        var phrasePattern = "$({rejoinedArray})";
        // @"(\s*\bbig\b\s*|\s*\bblue\b\s*|\s*\bwidget\b\s*)";
    
        // Do something to string...
    }
    

Note: There could be better ways of dealing with the word boundaries spacing but I'm not a regex expert.

I'm looking for some help/advice to take the split array, wrap it , then rejoin it in the neatest way.

like image 851
John Ohara Avatar asked May 08 '19 09:05

John Ohara


Video Answer


2 Answers

You need to enclose all your alternatives within a non-capturing group, (?:...|...). Besides, to further counter eventual issues, I suggest replacing word boundaries with their lookaround unambiguous equivalents, (?<!\w)...(?!\w).

Here is a working C# snippet:

var text = "there are big widgets in this phrase blue widgets too";
var words = "big blue widgets";
var pattern = $@"(\s*(?<!\w)(?:{string.Join("|", words.Split(' ').Select(Regex.Escape))})(?!\w)\s*)";
var result = string.Concat(Regex.Split(text, pattern, RegexOptions.IgnoreCase).Select((str, index) =>
            index % 2 == 0 && !string.IsNullOrWhiteSpace(str) ? $"<b>{str}</b>" : str));
 Console.WriteLine(result);

NOTES

  • words.Split(' ').Select(Regex.Escape) - splits the words text with spaces and regex-escapes each item
  • string.Join("|",...) re-builds the string inserting | between the items
  • (?<!\w) negative lookbehind matches a location that is not immediately preceded with a word char, and (?!\w) negative lookahead matches a location that is not immediately followed with a word char.
like image 115
Wiktor Stribiżew Avatar answered Sep 28 '22 18:09

Wiktor Stribiżew


I suggest implementing FSM (Finite State Machine) with 2 states (in and out selection) and Regex.Replace (we can keep the word as it is - word or replace it with <b>word, word<\b> or <b>word<\b>)

private static string MyModify(string text, string wordsToExclude) {
  HashSet<string> exclude = new HashSet<string>(
    wordsToExclude.Split(' '), StringComparer.OrdinalIgnoreCase);

  bool inSelection = false;

  string result = Regex.Replace(text, @"[\w']+", match => {
      var next = match.NextMatch();

      if (inSelection) {
        if (next.Success && exclude.Contains(next.Value)) {
          inSelection = false;

          return match.Value + "</b>";
        }
        else
          return match.Value;
      }
      else {
        if (exclude.Contains(match.Value))
          return match.Value;
        else if (next.Success && exclude.Contains(next.Value))
          return "<b>" + match.Value + "</b>";
        else {
          inSelection = true;
          return "<b>" + match.Value;
        }
      }
    });

  if (inSelection)
    result += "</b>";

  return result;
}

Demo:

string wordsToExclude = "big widgets blue if";

string[] tests = new string[] {
  "widgets for big blue",
  "big widgets are great but better if blue",
  "blue",
  "great but expensive",
  "big and small, blue and green",
};

string report = string.Join(Environment.NewLine, tests
  .Select(test => $"{test,-40} -> {MyModify(test, wordsToExclude)}"));

Console.Write(report);

Outcome:

widgets for big blue                     -> widgets <b>for</b> big blue
big widgets are great but better if blue -> big widgets <b>are great but better</b> if blue
blue                                     -> blue
great but expensive                      -> <b>great but expensive</b>
big and small, blue and green            -> big <b>and small</b>, blue <b>and green</b>
like image 39
Dmitry Bychenko Avatar answered Sep 28 '22 17:09

Dmitry Bychenko