Sentence detection using NLP

2 Answers

First you have to clearly define the task. What, precisely, is your definition of 'a sentence?' Until you have such a definition, you will just wander in circles.

Second, cleaning dirty text is usually a rather different task from 'sentence splitting'. The various NLP sentence chunkers are assuming relatively clean input text. Getting from HTML, or extracted powerpoint, or other noise, to text is another problem.

Third, Stanford and other large caliber devices are statistical. So, they are guaranteed to have a non-zero error rate. The less your data looks like what they were trained on, the higher the error rate.

answered Sep 19 '22 00:09

bmargulies

Write a custom sentence splitter. You could use something like the Stanford splitter as a first pass and then write a rule based post-processor to correct mistakes.

I did something like this for biomedical text I was parsing. I used the GENIA splitter and then fixed stuff after the fact.

EDIT: If you are taking in input HTML, then you should preprocess it first, for example handling bulleted lists and stuff. Then apply your splitter.

answered Sep 19 '22 00:09

nflacco

Related questions
                            
                                SWT event propagation
                            
                                What is the best way to reset the database to a known state while testing database operations?
                            
                                HTTP Client with NIO2
                            
                                Android OverlayItem.setMarker(): Change the marker for one item
                            
                                Check if app available on Android Market
                            
                                ANTLR @header, @parser, superClass option and basic file io (Java)
                            
                                Retrieve Spring Security's Authentication, even on public pages with filter="none"
                            
                                Java issue with var-args and boxing
                            
                                Which Java based workflow engine should I use? [closed]
                            
                                What is the advantage of new Lock interface over synchronized block in Java?
                            
                                Monitoring of network traffic
                            
                                Android Actionbar Tabs and Keyboard Focus
                            
                                Regex in java and its performance compared to indexOf
                            
                                Resources For Guava [closed]
                            
                                Java saving/opening File objects
                            
                                Is Spring's @Autowired a huge performance issue?
                            
                                Hibernate - HQL to fetch a collection from Unidirectional OneToMany relationship
                            
                                How to represent text for classification in weka?
                            
                                Does HttpUrlConnection censor some headers, notably Origin?
                            
                                JDBC - Statement, PreparedStatement, CallableStatement and caching

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sentence detection using NLP

Tags:

java

text-segmentation

nlp

opennlp

Roopak Venkatakrishnan

People also ask

2 Answers

bmargulies

nflacco

Recent Activity

Donate For Us