Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sentence detection using NLP

I am trying to parse out sentences from a huge amount of text. using java I started off with NLP tools like OpenNLP and Stanford's Parser.

But here is where i get stuck. though both these parsers are pretty great they fail when it comes to a non uniform text.

For example in my text most sentences are delimited by a period, but in some cases like bullet points they aren't. Here both the parses fail miserably.

I even tried setting the option in the stanford parses for multiple sentence terminators but the output was not much better!

Any ideas??

Edit :To make it simpler I am looking to parse text where the delimiter is either a new line ("\n") or a period(".") ...

like image 877
Roopak Venkatakrishnan Avatar asked Dec 12 '11 08:12

Roopak Venkatakrishnan


People also ask

What is sentence detection in NLP?

The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation.

What is sentence boundaries in NLP?

Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end.

What is a sentence boundary error?

Sentence boundary errors are also commonly known as run-on sentences, comma splices, and fused sentences. These errors are all variations of the same problem— two sentences or independent clauses that are joined together incorrectly.


2 Answers

First you have to clearly define the task. What, precisely, is your definition of 'a sentence?' Until you have such a definition, you will just wander in circles.

Second, cleaning dirty text is usually a rather different task from 'sentence splitting'. The various NLP sentence chunkers are assuming relatively clean input text. Getting from HTML, or extracted powerpoint, or other noise, to text is another problem.

Third, Stanford and other large caliber devices are statistical. So, they are guaranteed to have a non-zero error rate. The less your data looks like what they were trained on, the higher the error rate.

like image 59
bmargulies Avatar answered Sep 19 '22 00:09

bmargulies


Write a custom sentence splitter. You could use something like the Stanford splitter as a first pass and then write a rule based post-processor to correct mistakes.

I did something like this for biomedical text I was parsing. I used the GENIA splitter and then fixed stuff after the fact.

EDIT: If you are taking in input HTML, then you should preprocess it first, for example handling bulleted lists and stuff. Then apply your splitter.

like image 21
nflacco Avatar answered Sep 19 '22 00:09

nflacco