Java library that finds sentence boundaries

Question

Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.

Here's my experience with BreakIterator:

Using the example here: I have the following Japanese:

今日はパソコンを買った。高性能のマックは早い！とても快適です。

In ascii, it looks like this:

\ufeff\u4eca\u65e5\u306f\u30d1\u30bd\u30b3\u30f3\u3092\u8cb7\u3063\u305f\u3002\u9ad8\u6027\u80fd\u306e\u30de\u30c3\u30af\u306f\u65e9\u3044\uff01\u3068\u3066\u3082\u5feb\u9069\u3067\u3059\u3002

Here's the part of that sample that I changed: static void sentenceExamples() {

  Locale currentLocale = new Locale ("ja","JP");
  BreakIterator sentenceIterator = 
     BreakIterator.getSentenceInstance(currentLocale);
  String someText = "今日はパソコンを買った。高性能のマックは早い！とても快適です。";

When I look at the Boundary indices, I see this:

0|13|24|32

But those indices don't correspond to any sentence terminators.

GaryF · Accepted Answer

You want to look into the internationalized BreakIterator classes. A good starting point for sentence boundaries.

Fabian Steeg · Answer

You wrote:

I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.

A basic problem here is that sentence terminators depend on the context, consider:

How did Dr. Jones compute 5! without recursion?

This should be recognized as a single sentence, but if you just split on possible sentence terminators you will get three sentences.

So this is a more complex problem than one might think in the beginning. It can be approached using machine learning techniques. You could for instance look into the OpenNLP project, in particular the SentenceDetectorME class.

Java library that finds sentence boundaries

Tags:

java

string

text-segmentation

nlp

Mike Sickler

2 Answers

GaryF

Fabian Steeg

Recent Activity

Donate For Us

Java library that finds sentence boundaries

Tags:

java

string

text-segmentation

nlp

Mike Sickler

2 Answers

GaryF

Fabian Steeg

Related questions

Recent Activity

Donate For Us