I'm having difficulty getting the CoreNLP system to correctly find where one sentence ends and another begins in a corpus of poetry.
The reasons why it's struggling:
This is a particularly tricky one (The system thought the first sentence ended at the "." at the beginning of the second stanza)
Given the lack of capitals and punctuation to go on, I thought that I would try using -tokenizeNLs to see if that improved it, but it went overboard, and cut off any sentence that ran between blank lines (which there are a few of)
These sentences often end at the end of a line, but not always, so what would be slick is if the system could look at a line ending as a potential candidate for a sentence break, and maybe weigh the likelihood of those being the endpoints, but I don't know how I would implement that.
Is there an elegant way to do this? Or an alternative?
Thanks in advance!
(expected sentence output here)
I built a sentence segmenter that works excellently on unpunctuated or partially punctuated text too. You can find it at https://github.com/bedapudi6788/deepsegment .
This models is based on the idea that Named Entity Recognition can be used for sentence boundary (i.e: beginning of a sentence or ending of a sentence). I utilised data from tatoeba for generating the training data and trained a BiLSTM+CRF model with glove embeddings and character level for this task.
Although this is built in Python, you will be able to setup a simple rest api using flask and use it along with your Java code.
This would be a neat project! I don't think anyone is working on it in our group at the moment, but I see no reason why we wouldn't incorporate a patch if you make one. The biggest challenge I see is that our sentence splitter is currently entirely rule-based, and therefore these sorts of "soft" decisions are relatively hard to incorporate.
A possible solution for your case could be to use language model "end of sentence" probabilities (Three options, in no particular order: https://kheafield.com/code/kenlm/, https://code.google.com/p/berkeleylm/, http://www.speech.sri.com/projects/srilm/). Then, line ends with a sufficiently high end of sentence probability could get split as new sentences.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With