Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java library that finds sentence boundaries

Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.

Here's my experience with BreakIterator:

Using the example here: I have the following Japanese:

今日はパソコンを買った。高性能のマックは早い!とても快適です。

In ascii, it looks like this:

\ufeff\u4eca\u65e5\u306f\u30d1\u30bd\u30b3\u30f3\u3092\u8cb7\u3063\u305f\u3002\u9ad8\u6027\u80fd\u306e\u30de\u30c3\u30af\u306f\u65e9\u3044\uff01\u3068\u3066\u3082\u5feb\u9069\u3067\u3059\u3002

Here's the part of that sample that I changed: static void sentenceExamples() {

  Locale currentLocale = new Locale ("ja","JP");
  BreakIterator sentenceIterator = 
     BreakIterator.getSentenceInstance(currentLocale);
  String someText = "今日はパソコンを買った。高性能のマックは早い!とても快適です。";

When I look at the Boundary indices, I see this:

0|13|24|32

But those indices don't correspond to any sentence terminators.

like image 334
Mike Sickler Avatar asked Jan 27 '09 13:01

Mike Sickler


2 Answers

You want to look into the internationalized BreakIterator classes. A good starting point for sentence boundaries.

like image 174
GaryF Avatar answered Nov 12 '22 18:11

GaryF


You wrote:

I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.

A basic problem here is that sentence terminators depend on the context, consider:

How did Dr. Jones compute 5! without recursion?

This should be recognized as a single sentence, but if you just split on possible sentence terminators you will get three sentences.

So this is a more complex problem than one might think in the beginning. It can be approached using machine learning techniques. You could for instance look into the OpenNLP project, in particular the SentenceDetectorME class.

like image 3
Fabian Steeg Avatar answered Nov 12 '22 18:11

Fabian Steeg