Please have a look at the following.
String[]sentenceHolder = titleAndBodyContainer.split("\n|\\.(?!\\d)|(?<!\\d)\\.");
This is how I tried to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan. 13, 2014
, words like U.S
and numbers like 2.2
. They all got splitted by the above code. So basically, this code splits lot of 'dots' whether it is a full stop or not.
I tried String[]sentenceHolder = titleAndBodyContainer.split(".\n");
and String[]sentenceHolder = titleAndBodyContainer.split("\\.");
as well. All failed.
How can I split a paragraph into sentences "properly"?
Use sent_tokenize() to split text into sentences Call nltk. tokenize. sent_tokenize(text) with a string as text to split the string into a list of sentences.
For splitting sentences first mark the clauses. Then make sub-clauses independent by omitting subordinating linkers and inserting subjects or other words wherever necessary. Example – When I went to Delhi I met my friend who lives there. Clause 1 (When) I went to Delhi.
In Word documents etc., each newline indicates a new paragraph so you'd just use `text. split(“\n”)` (where `text` is a string variable containing the text of your file). In other formats, paragraphs are separated by a blank line (two consecutive newlines), so you'd use `text.
Splitting a string by sentence as a delimiter You can also split a sentence by passing a sentence as a delimiter if you do so each time the specified sentence occurs the String is divided as a separate token.
You can try this
String str = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2. They all got split by the above code.";
Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher reMatcher = re.matcher(str);
while (reMatcher.find()) {
System.out.println(reMatcher.group());
}
Output:
This is how I tried to split a paragraph into a sentence.
But, there is a problem.
My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2.
They all got split by the above code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With