Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a regular expression for parsing out individual sentences?

Tags:

I am looking for a good .NET regular expression that I can use for parsing out individual sentences from a body of text.

It should be able to parse the following block of text into exactly six sentences:

Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.  Newlines should also be accepted. Numbers should not cause   sentence breaks, like 1.23. 

This is proving a little more challenging than I originally thought.

Any help would be greatly appreciated. I am going to use this to train the system on known bodies of text.

like image 277
Luke Machowski Avatar asked Dec 20 '09 17:12

Luke Machowski


People also ask

What is parsing in regex?

The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.

How do you match a sentence with regex?

My sentence must start with either one or more whitespaces/tabs. (tabs and spaces can be bunched together before any non-whitespace phrase of characters appears). Each word after the first must be separated by a whitespace. And yes, the sentence must end with a punctuation.

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.

Can you parse regex with regex?

No, it is not possible: regular expression language allows parenthesized expressions representing capturing and non-capturing groups, lookarounds, etc., where parentheses must be balanced.


2 Answers

Try this @"(\S.+?[.!?])(?=\s+|$)":

string str=@"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D. Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";  Regex rx = new Regex(@"(\S.+?[.!?])(?=\s+|$)"); foreach (Match match in rx.Matches(str)) {     int i = match.Index;     Console.WriteLine(match.Value); } 

Results:

Hello world! How are you? I am fine. This is a difficult sentence because I use I.D. Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23. 

For complicated ones, of course, you will need a real parser like SharpNLP or NLTK. Mine is just a quick and dirty one.

Here is the SharpNLP info, and features:

SharpNLP is a collection of natural language processing tools written in C#. Currently it provides the following NLP tools:

  • a sentence splitter
  • a tokenizer
  • a part-of-speech tagger
  • a chunker (used to "find non-recursive syntactic annotations such as noun phrase chunks")
  • a parser
  • a name finder
  • a coreference tool
  • an interface to the WordNet lexical database
like image 158
YOU Avatar answered Sep 20 '22 15:09

YOU


var str = @"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D. Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";  Regex.Split(str, @"(?<=[.?!])\s+").Dump(); 

I tested this in LINQPad.

like image 21
SLaks Avatar answered Sep 24 '22 15:09

SLaks