Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string into sentences using regular expression

Tags:

c#

regex

nlp

I need to match a string like "one. two. three. four. five. six. seven. eight. nine. ten. eleven" into groups of four sentences. I need a regular expression to break the string into a group after every fourth period. Something like:

  string regex = @"(.*.\s){4}";

  System.Text.RegularExpressions.Regex exp = new System.Text.RegularExpressions.Regex(regex);

  string result = exp.Replace(toTest, ".\n");

doesn't work because it will replace the text before the periods, not just the periods themselves. How can I count just the periods and replace them with a period and new line character?

like image 767
Tai Squared Avatar asked Nov 05 '22 22:11

Tai Squared


2 Answers

Try defining the method

private string AppendNewLineToMatch(Match match) {
    return match.Value + Environment.NewLine;
}

and using

string result = exp.Replace(toTest, AppendNewLineToMatch);

This should call the method for each match, and replace it with that method's result. The method's result would be the matching text and a newline.


EDIT: Also, I agree with Oliver. The correct regex definition should be:

  string regex = @"([^.]*[.]\s*){4}";

Another edit: Fixed the regex, hopefully I got it right this time.

like image 42
configurator Avatar answered Nov 12 '22 18:11

configurator


. in a regex means "any character"

so in your regex, you have used .*. which will match a word (this is equivalent to .+)

You were probably looking for [^.]\*[.] - a series of characters that are not "."s followed by a ".".

like image 194
Oliver Hallam Avatar answered Nov 12 '22 17:11

Oliver Hallam