I am trying to parse out sentences from a huge amount of text. using java I started off with NLP tools like OpenNLP and Stanford's Parser.
But here is where i get stuck. though both these parsers are pretty great they fail when it comes to a non uniform text.
For example in my text most sentences are delimited by a period, but in some cases like bullet points they aren't. Here both the parses fail miserably.
I even tried setting the option in the stanford parses for multiple sentence terminators but the output was not much better!
Any ideas??
Edit :To make it simpler I am looking to parse text where the delimiter is either a new line ("\n") or a period(".") ...
The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation.
Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end.
Sentence boundary errors are also commonly known as run-on sentences, comma splices, and fused sentences. These errors are all variations of the same problem— two sentences or independent clauses that are joined together incorrectly.
First you have to clearly define the task. What, precisely, is your definition of 'a sentence?' Until you have such a definition, you will just wander in circles.
Second, cleaning dirty text is usually a rather different task from 'sentence splitting'. The various NLP sentence chunkers are assuming relatively clean input text. Getting from HTML, or extracted powerpoint, or other noise, to text is another problem.
Third, Stanford and other large caliber devices are statistical. So, they are guaranteed to have a non-zero error rate. The less your data looks like what they were trained on, the higher the error rate.
Write a custom sentence splitter. You could use something like the Stanford splitter as a first pass and then write a rule based post-processor to correct mistakes.
I did something like this for biomedical text I was parsing. I used the GENIA splitter and then fixed stuff after the fact.
EDIT: If you are taking in input HTML, then you should preprocess it first, for example handling bulleted lists and stuff. Then apply your splitter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With