I am making a regular expression to find the end of sentences in a text. Here for I assume that any sentence can end with either .!? Sometimes though people like two write !!!!!! at the and of their sentence. So I want to replace any repeating dots, exclamation marks or question marks. But I want to allow the use of '...'. How can I include this exception? Please advise, Thanks!
Pattern p = null;
try {
//([!?.] with optional spaces), followed by ([!?.] with optional spaces) repeated 1 or more times
p = Pattern.compile("([!?.]\\s*)([!?.]\\s*)+");
}
catch (PatternSyntaxException pex) {
pex.printStackTrace();
System.exit(0);
}
//get the matcher
Matcher m = p.matcher(this.sentence);
int index = 0;
while(m.find(index))
{
System.out.println(this.sentence);
System.out.println(p.toString());
String toReplace = sentence.substring(m.start(), m.end());
toReplace = toReplace.replaceAll("\\.","\\\\.");
toReplace =toReplace.replaceAll("\\?","\\\\?");
String replacement = ""+sentence.charAt(m.start());
this.sentence = this.sentence.replaceAll(toReplace, replacement);
System.out.println("");
index = m.end();
System.out.println(this.sentence);
}
Disclaimer: my answer will be off topic (not using regular expressions).
If it's not too heavyweight, try using Apache OpenNLP. NLP means "natural language processing". Check documentation on detecting sentences.
The relevant bit of code is:
String sentences[] = sentenceDetector.sentDetect(" First sentence. Second sentence. ");
You'll get an array of two Strings
. First one will be "First sentence.", second one will be "Second sentence.".
There's more code to be written before using aforementioned line of code, but you get the general idea.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With