I have a corpus of sentences that were preprocessed by Stanford's CoreNLP systems. One of the things it provides is the sentence's Parse Tree (Constituency-based). While I can understand a parse tree when it's drawn (like a tree), I'm not sure how to read it in this format:
E.g.:
(ROOT
(FRAG
(NP (NN sent28))
(: :)
(S
(NP (NNP Rome))
(VP (VBZ is)
(PP (IN in)
(NP
(NP (NNP Lazio) (NN province))
(CC and)
(NP
(NP (NNP Naples))
(PP (IN in)
(NP (NNP Campania))))))))
(. .)))
The original sentence is:
sent28: Rome is in Lazio province and Naples in Campania .
How am I supposed to read this tree, or alternatively, is there a code (in python) that does it properly? Thanks.
The constituency parse tree is based on the formalism of context-free grammars. In this type of tree, the sentence is divided into constituents, that is, sub-phrases that belong to a specific category in the grammar.
Constituency Parsing is the process of analyzing the sentences by breaking down it into sub-phrases also known as constituents. These sub-phrases belong to a specific category of grammar like NP (noun phrase) and VP(verb phrase).
Dependency parsing displays only relationships between words and their constitutes while constituency parsing displays the entire sentence structure and relationships. Often dependency parsing is praised for being concise yet informative, but constituency parsing is often easier to read and understand.
NLTK
has a class for reading parse trees: nltk.tree.Tree
. The relevant method is called fromstring
. You can then iterate its subtrees, leaves, etc...
As an aside: you might want to remove the bit that says sent28:
as it confuses the parser (it's also not a part of the sentence). You are not getting a full parse tree, but just a sentence fragment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With