Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting a paragraph into individual sentences. Am I covering all my bases here?

Tags:

java

regex

I'm trying to split a string with multiple sentences into a string array of individual sentences.

Here's what I have so far,

String input = "Hello World. " 
             + "Today in the U.S.A., it is a nice day! "
             + "Hurrah!"
             + "Here it comes... "
             + "Party time!";
String array[] = input.split("(?<=[.?!])\\s+(?=[\\D\\d])");

And this code is working perfectly fine. I get,

Hello World.
Today in the U.S.A., it is a nice day!
Hurrah!
Here it comes...
Party time!

I use the lookbehind functionality to see if a sentence ending punctuation mark precedes some or a single white space(s). If so, we split.

But there are some exceptions that this regex doesn't cover. For example, The U.S. is a great country, is incorrectly split as The U.S. and is a great country.

Any idea on how I can fix this?

And also, am I missing any edge cases here?

like image 766
Ganz7 Avatar asked Aug 24 '15 03:08

Ganz7


1 Answers

If you don't have to use a regular expression, you can make use of Java's built-in BreakIterator.

The following code shows an example of parsing sentences, however BreakIterator supports other forms of parsing (word, line, etc.). You can also, optionally, pass in different locales if you are dealing with different languages. This example uses the default locale.

String input = "Hello World. " 
    + "Today in the U.S.A., it is a nice day! "
    + "Hurrah!"
    + "The U.S. is a great country. "
    + "Here it comes... "
    + "Party time!";
BreakIterator iterator = BreakIterator.getSentenceInstance();
iterator.setText(input);
int start = iterator.first();
for (int end = iterator.next(); end != BreakIterator.DONE; start = end, end = iterator.next()) {
    System.out.println(input.substring(start, end));
}

This results in the following output:

Hello World. 
Today in the U.S.A., it is a nice day! 
Hurrah!
The U.S. is a great country. 
Here it comes... 
Party time!
like image 76
Marc Baumbach Avatar answered Nov 09 '22 02:11

Marc Baumbach