Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SyntaxNet creating tree to root verb

I am new to Python and the world of NLP. The recent announcement of Google's Syntaxnet intrigued me. However I am having a lot of trouble understanding documentation around both syntaxnet and related tools (nltk, etc.)

My goal: given an input such as "Wilbur kicked the ball" I would like to extract the root verb (kicked) and the object it pertains to "the ball".

I stumbled across "spacy.io" and this visualization seems to encapsulate what I am trying to accomplish: POS tag a string, and load it into some sort of tree structure so that I can start at the root verb and traverse the sentence.

I played around with the syntaxnet/demo.sh, and as suggested in this thread commented out the last couple lines to get conll output.

I then loaded this input in a python script (kludged together myself, probably not correct):

import nltk
from nltk.corpus import ConllCorpusReader
columntypes = ['ignore', 'words', 'ignore', 'ignore', 'pos']
corp = ConllCorpusReader('/Users/dgourlay/development/nlp','input.conll', columntypes)

I see that I have access to corp.tagged_words(), but no relationship between the words. Now I am stuck! How can I load this corpus into a tree type structure?

Any help is much appreciated!

like image 562
Derek Gourlay Avatar asked May 17 '16 08:05

Derek Gourlay


2 Answers

This may have been better as a comment, but I don't yet have the required reputation.

I haven't used the ConllCorpusreader before (would you consider uploading the file you are loading to a gist and providing a link? It would be much easier to test), but I wrote a blog post which may help with the tree parsing aspect: here.

In particular, you probably want to chunk each sentence. Chapter 7 of the NLTK book has some more information on this, but this is the example from my blog:

# This grammar is described in the paper by S. N. Kim,
# T. Baldwin, and M.-Y. Kan.
# Evaluating n-gram based evaluation metrics for automatic
# keyphrase extraction.
# Technical report, University of Melbourne, Melbourne 2010.
grammar = r"""
NBAR:
  # Nouns and Adjectives, terminated with Nouns
  {<NN.*|JJ>*<NN.*>}

NP:
  {<NBAR>}
    # Above, connected with in/of/etc...
  {<NBAR><IN><NBAR>}
"""

chunker = nltk.RegexpParser(grammar)
tree = chunker.parse(postoks)

Note: You could also use a Context Free Grammar (covered in Chapter 8).

Each chunked (or parsed) sentence (or in this example Noun Phrase, according to the grammar above) will be a subtree. To access these subtrees, we can use this function:

def leaves(tree):
  """Finds NP (nounphrase) leaf nodes of a chunk tree."""
  for subtree in tree.subtrees(filter = lambda t: t.node=='NP'):
    yield subtree.leaves()

Each of the yielded objects will be a list of word-tag pairs. From there you can find the verb.

Next, you could play with the grammar above or the parser. Verbs split noun phrases (see this diagram in Chapter 7), so you can probably just access the first NP after a VBD.

Sorry for the solution not being specific to your problem, but hopefully it is a helpful starting point. If you upload the file(s) I'll take another shot :)

like image 133
Alex Bowe Avatar answered Oct 20 '22 15:10

Alex Bowe


What you are trying to do is to find a dependency, namely dobj. I'm not yet familiar enough with SyntaxNet/Parsey to tell you how exactly to go extracting that dependency from it's output, but I believe this answer might help you. In short, you can configure Parsey to use ConLL syntax for output, parse it into whatever you find easy to traverse, then look for ROOT dependency to find the verb and *obj dependencies to find its objects.

like image 23
maga Avatar answered Oct 20 '22 16:10

maga