Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Paragraph breaks using Stanford CoreNLP

Is there a way to extract paragraph information from Stanford CoreNLP? I'm currently using it to extract sentences from a document, but am also interested in identifying the paragraph structure of the document, which I'd ideally like CoreNLP to do for me. I have paragraph breaks as double line breaks in my source document. I've looked through CoreNLP's javadoc, and it seems there is a ParagraphAnnotation class, but the documentation doesn't seem to specify what it contains, and I see no example anywhere of how to use it. Can anyone point me in the right direction?

For reference, my current code does something like this:

    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    List<Sentence> convertedSentences = new ArrayList<> ();
    for (CoreMap sentence : sentences)
    {
        convertedSentences.add (new Sentence (sentence));
    }

where Sentence's constructor extracts the words from the sentence. How would I extend this so that I get an extra level of data, that is my currently document-wide 'convertedSentences' list is supplemented by a 'convertedParagraphs' list, each entry of which contains a 'convertedSentences' list?

I tried the approach that seemed most obvious to me:

List<CoreMap> paragraphs = document.get(ParagraphsAnnotation.class);
for (CoreMap paragraph : paragraphs)
{
        List<CoreMap> sentences = paragraph.get(SentencesAnnotation.class);
        List<Sentence> convertedSentences = new ArrayList<> ();
        for (CoreMap sentence : sentences)
        {
            convertedSentences.add (new Sentence (sentence));
        }

        convertedParagraphs.add (new Paragraph (convertedSentences));
}

but this didn't work, so I guess I misunderstand something about how this is supposed to work.

like image 659
Jules Avatar asked Oct 20 '25 10:10

Jules


1 Answers

It appears that the existence of a ParagraphsAnnotation class in CoreNLP is a red herring - nothing actually uses this class (see http://grepcode.com/search/usages?type=type&id=repo1.maven.org%[email protected]%[email protected]@edu%24stanford%24nlp%[email protected]&k=u - quite literally, there are no references to this class other than its definition). Therefore, I have to break the paragraphs myself.

The key to this is to notice that each sentence contained within the SentencesAnnotation contains a CharacterOffsetBeginAnnotation. My code then becomes something like this:

    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    List<Sentence> convertedSentences = new ArrayList<> ();
    for (CoreMap sentence : sentences)
    {
        int sentenceOffsetStart = sentence.get (CharacterOffsetBeginAnnotation.class);
        if (sentenceOffsetStart > 1 && text.substring (sentenceOffsetStart - 2, sentenceOffsetStart).equals("\n\n") && !convertedSentences.isEmpty ())
        {
            Paragraph current = new Paragraph (convertedSentences);
            paragraphs.add (current);
            convertedSentences = new ArrayList<> ();
        }           
        convertedSentences.add (new Sentence (sentence));
    }
    Paragraph current = new Paragraph (convertedSentences);
    paragraphs.add (current);
like image 75
Jules Avatar answered Oct 21 '25 23:10

Jules



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!