I'm trying to split a string with multiple sentences into a string array of individual sentences.
Here's what I have so far,
String input = "Hello World. "
+ "Today in the U.S.A., it is a nice day! "
+ "Hurrah!"
+ "Here it comes... "
+ "Party time!";
String array[] = input.split("(?<=[.?!])\\s+(?=[\\D\\d])");
And this code is working perfectly fine. I get,
Hello World.
Today in the U.S.A., it is a nice day!
Hurrah!
Here it comes...
Party time!
I use the lookbehind
functionality to see if a sentence ending punctuation mark precedes some or a single white space(s)
. If so, we split.
But there are some exceptions that this regex doesn't cover. For example,
The U.S. is a great country
, is incorrectly split as The U.S.
and is a great country
.
Any idea on how I can fix this?
And also, am I missing any edge cases here?
If you don't have to use a regular expression, you can make use of Java's built-in BreakIterator.
The following code shows an example of parsing sentences, however BreakIterator supports other forms of parsing (word, line, etc.). You can also, optionally, pass in different locales if you are dealing with different languages. This example uses the default locale.
String input = "Hello World. "
+ "Today in the U.S.A., it is a nice day! "
+ "Hurrah!"
+ "The U.S. is a great country. "
+ "Here it comes... "
+ "Party time!";
BreakIterator iterator = BreakIterator.getSentenceInstance();
iterator.setText(input);
int start = iterator.first();
for (int end = iterator.next(); end != BreakIterator.DONE; start = end, end = iterator.next()) {
System.out.println(input.substring(start, end));
}
This results in the following output:
Hello World.
Today in the U.S.A., it is a nice day!
Hurrah!
The U.S. is a great country.
Here it comes...
Party time!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With