Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split text into sentences in C# [closed]

Tags:

c#

text

split

I want to divide a text into sentences. A sentence ends with (dot) or ? or ! followed by one or more whitespace characters followed and the next sentence starts with an uppercase letter.

For example:

First sentence. Second sentence!

How can I do that?

like image 995
Lato Avatar asked Feb 10 '11 12:02

Lato


4 Answers

You can split on a regular expression that matches white space, with a lookbehind that looks for the sentence terminators:

string[] sentences = Regex.Split(input, @"(?<=[\.!\?])\s+");

This will split on the white space characters and keep the terminators in the sentences.

Example:

string input = "First sentence. Second sentence! Third sentence? Yes.";
string[] sentences = Regex.Split(input, @"(?<=[\.!\?])\s+");

foreach (string sentence in sentences) {
  Console.WriteLine(sentence);
}

Output:

First sentence.
Second sentence!
Third sentence?
Yes.
like image 146
Guffa Avatar answered Oct 17 '22 22:10

Guffa


What languages do you want to support? For example, in Thai there are no spaces between words and sentences are separated with space. So, in general, this task is very complex. Also consider the useful comment by Fredrik Mörk.

So, at first you need to define set of rules on what "sentence" is. Then you are welcome to use one of the suggested solutions.

like image 26
wonder.mice Avatar answered Oct 17 '22 23:10

wonder.mice


Have you tried String.Split()? See the docs about it here

like image 36
m.edmondson Avatar answered Oct 18 '22 00:10

m.edmondson


Try this (MSDN)

char[] separators = new char[] {'!', '.', '?'};
string[] sentences1 = "First sentence. Second sentence!".Split(separators);
//or...
string[] sentences2 = "First sentence. Second sentence!".Split('!', '.', '?');
like image 43
Simon Avatar answered Oct 17 '22 22:10

Simon