Is there a way to extract paragraph information from Stanford CoreNLP? I'm currently using it to extract sentences from a document, but am also interested in identifying the paragraph structure of the document, which I'd ideally like CoreNLP to do for me. I have paragraph breaks as double line breaks in my source document. I've looked through CoreNLP's javadoc, and it seems there is a ParagraphAnnotation
class, but the documentation doesn't seem to specify what it contains, and I see no example anywhere of how to use it. Can anyone point me in the right direction?
For reference, my current code does something like this:
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
List<Sentence> convertedSentences = new ArrayList<> ();
for (CoreMap sentence : sentences)
{
convertedSentences.add (new Sentence (sentence));
}
where Sentence's constructor extracts the words from the sentence. How would I extend this so that I get an extra level of data, that is my currently document-wide 'convertedSentences' list is supplemented by a 'convertedParagraphs' list, each entry of which contains a 'convertedSentences' list?
I tried the approach that seemed most obvious to me:
List<CoreMap> paragraphs = document.get(ParagraphsAnnotation.class);
for (CoreMap paragraph : paragraphs)
{
List<CoreMap> sentences = paragraph.get(SentencesAnnotation.class);
List<Sentence> convertedSentences = new ArrayList<> ();
for (CoreMap sentence : sentences)
{
convertedSentences.add (new Sentence (sentence));
}
convertedParagraphs.add (new Paragraph (convertedSentences));
}
but this didn't work, so I guess I misunderstand something about how this is supposed to work.
It appears that the existence of a ParagraphsAnnotation
class in CoreNLP is a red herring - nothing actually uses this class (see http://grepcode.com/search/usages?type=type&id=repo1.maven.org%[email protected]%[email protected]@edu%24stanford%24nlp%[email protected]&k=u - quite literally, there are no references to this class other than its definition). Therefore, I have to break the paragraphs myself.
The key to this is to notice that each sentence contained within the SentencesAnnotation
contains a CharacterOffsetBeginAnnotation
. My code then becomes something like this:
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
List<Sentence> convertedSentences = new ArrayList<> ();
for (CoreMap sentence : sentences)
{
int sentenceOffsetStart = sentence.get (CharacterOffsetBeginAnnotation.class);
if (sentenceOffsetStart > 1 && text.substring (sentenceOffsetStart - 2, sentenceOffsetStart).equals("\n\n") && !convertedSentences.isEmpty ())
{
Paragraph current = new Paragraph (convertedSentences);
paragraphs.add (current);
convertedSentences = new ArrayList<> ();
}
convertedSentences.add (new Sentence (sentence));
}
Paragraph current = new Paragraph (convertedSentences);
paragraphs.add (current);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With