I am looking for a good .NET regular expression that I can use for parsing out individual sentences from a body of text.
It should be able to parse the following block of text into exactly six sentences:
Hello world! How are you? I am fine. This is a difficult sentence because I use I.D. Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.
This is proving a little more challenging than I originally thought.
Any help would be greatly appreciated. I am going to use this to train the system on known bodies of text.
The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.
My sentence must start with either one or more whitespaces/tabs. (tabs and spaces can be bunched together before any non-whitespace phrase of characters appears). Each word after the first must be separated by a whitespace. And yes, the sentence must end with a punctuation.
Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.
No, it is not possible: regular expression language allows parenthesized expressions representing capturing and non-capturing groups, lookarounds, etc., where parentheses must be balanced.
Try this @"(\S.+?[.!?])(?=\s+|$)"
:
string str=@"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D. Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23."; Regex rx = new Regex(@"(\S.+?[.!?])(?=\s+|$)"); foreach (Match match in rx.Matches(str)) { int i = match.Index; Console.WriteLine(match.Value); }
Results:
Hello world! How are you? I am fine. This is a difficult sentence because I use I.D. Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.
For complicated ones, of course, you will need a real parser like SharpNLP or NLTK. Mine is just a quick and dirty one.
Here is the SharpNLP info, and features:
SharpNLP is a collection of natural language processing tools written in C#. Currently it provides the following NLP tools:
var str = @"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D. Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23."; Regex.Split(str, @"(?<=[.?!])\s+").Dump();
I tested this in LINQPad.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With