Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split paragraphs into sentences?

Please have a look at the following.

String[]sentenceHolder = titleAndBodyContainer.split("\n|\\.(?!\\d)|(?<!\\d)\\.");

This is how I tried to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan. 13, 2014, words like U.S and numbers like 2.2. They all got splitted by the above code. So basically, this code splits lot of 'dots' whether it is a full stop or not.

I tried String[]sentenceHolder = titleAndBodyContainer.split(".\n"); and String[]sentenceHolder = titleAndBodyContainer.split("\\."); as well. All failed.

How can I split a paragraph into sentences "properly"?

like image 874
PeakGen Avatar asked Jan 29 '14 11:01

PeakGen


People also ask

How do you split a paragraph into a sentence in Python?

Use sent_tokenize() to split text into sentences Call nltk. tokenize. sent_tokenize(text) with a string as text to split the string into a list of sentences.

How do you split sentences?

For splitting sentences first mark the clauses. Then make sub-clauses independent by omitting subordinating linkers and inserting subjects or other words wherever necessary. Example – When I went to Delhi I met my friend who lives there. Clause 1 (When) I went to Delhi.

How do you split text into paragraphs?

In Word documents etc., each newline indicates a new paragraph so you'd just use `text. split(“\n”)` (where `text` is a string variable containing the text of your file). In other formats, paragraphs are separated by a blank line (two consecutive newlines), so you'd use `text.

How do you split a sentence in string?

Splitting a string by sentence as a delimiter You can also split a sentence by passing a sentence as a delimiter if you do so each time the specified sentence occurs the String is divided as a separate token.


1 Answers

You can try this

String str = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2. They all got split by the above code.";

Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher reMatcher = re.matcher(str);
while (reMatcher.find()) {
    System.out.println(reMatcher.group());
}

Output:

This is how I tried to split a paragraph into a sentence.
But, there is a problem.
My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2.
They all got split by the above code.
like image 52
Ruchira Gayan Ranaweera Avatar answered Oct 03 '22 02:10

Ruchira Gayan Ranaweera