Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sentence annotation in text without punctuation

I'm having difficulty getting the CoreNLP system to correctly find where one sentence ends and another begins in a corpus of poetry.

The reasons why it's struggling:

  • some poems have no punctuation throughout their entire length (and sometimes no case)
  • some poems have sentences that run from one paragraph into another
  • some poems have capitalization at the beginning of every line

This is a particularly tricky one (The system thought the first sentence ended at the "." at the beginning of the second stanza)

Given the lack of capitals and punctuation to go on, I thought that I would try using -tokenizeNLs to see if that improved it, but it went overboard, and cut off any sentence that ran between blank lines (which there are a few of)

These sentences often end at the end of a line, but not always, so what would be slick is if the system could look at a line ending as a potential candidate for a sentence break, and maybe weigh the likelihood of those being the endpoints, but I don't know how I would implement that.

Is there an elegant way to do this? Or an alternative?

Thanks in advance!

(expected sentence output here)

like image 797
Blair Avatar asked Jan 06 '15 21:01

Blair


2 Answers

I built a sentence segmenter that works excellently on unpunctuated or partially punctuated text too. You can find it at https://github.com/bedapudi6788/deepsegment .

This models is based on the idea that Named Entity Recognition can be used for sentence boundary (i.e: beginning of a sentence or ending of a sentence). I utilised data from tatoeba for generating the training data and trained a BiLSTM+CRF model with glove embeddings and character level for this task.

Although this is built in Python, you will be able to setup a simple rest api using flask and use it along with your Java code.

like image 64
Praneeth Bedapudi Avatar answered Sep 25 '22 00:09

Praneeth Bedapudi


This would be a neat project! I don't think anyone is working on it in our group at the moment, but I see no reason why we wouldn't incorporate a patch if you make one. The biggest challenge I see is that our sentence splitter is currently entirely rule-based, and therefore these sorts of "soft" decisions are relatively hard to incorporate.

A possible solution for your case could be to use language model "end of sentence" probabilities (Three options, in no particular order: https://kheafield.com/code/kenlm/, https://code.google.com/p/berkeleylm/, http://www.speech.sri.com/projects/srilm/). Then, line ends with a sufficiently high end of sentence probability could get split as new sentences.

like image 24
Gabor Angeli Avatar answered Sep 25 '22 00:09

Gabor Angeli