Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing two similar, non-identical NLTK trees

I am trying to write a program that will take in two sentences and check if they are similar. I didn't want to use a full-fledged parser, and created one using a simple grammar that I think I would encounter most often. Now, my interest is in the noun phrases in the sentences. Checking for equality of the subtrees tagged as noun phrases would be easy enough. I want to add more to this, and let the user decide if missing/mismatched determiners is to be accepted(partial matches).

The output tree is of the form (S (NP The/DT bag/NN) is/VBZ (JP blue/JJ)), where I have defined the grammar noun phrases(NP) and adjective phrases(JP)

To go about matching, I've considered a few routes:

  • to delete the determiner nodes in the relevant trees and then compare
  • to change the value of all determiner nodes to a common value, say, X
  • to make a list of all leaf nodes except those tagged as 'DT'

I'm new to python and am facing a few problems here:

  • if I write a recursive function to traverse the noun phrase tree till it reaches a leaf with a determiner, I am unable to modify the value in the original tree, as it's only passing the value.

  • the only delete function I found with respect to nltk trees is one that requires the exact index of the node to be deleted with respect to the root of the tree, in a format such as [0,0] if it's the leftmost child of the leftmost child of the root node. This is tricky to get as it would most likely involve a list of integers that grows with the height of the tree, for each node

  • I created a list of lists, where each list has all the leaves from one noun phrase excluding the determiners, and compared these.

So, my questions are,

How do I delete a node from an NLTK tree, without first obtaining it's index in the form [0,0,1,0,...]?

How do I modify a leaf value, again without using an index?(I would like to use a recursive function, and whenever the function hits a leaf I want to modify, I would like to modify it)

If these aren't possible, how can I obtain the index of a leaf? I'm stumped at this. Nltk trees have a treeposition function, but this only works for subtrees. Does Python consider the leaf to be a different type when compared to other nodes? Because treeposition isn't working for my leaves. This might be because my leafs are tuples and not just strings, but I don't know how to change that, because that's the pos tagger's output. So is there some way replace my leaf, which is a tuple of the form [the/DT] with a subtree of the form (DT the)? Defining recursive procedures again won't modify the original tree.

Any suggestions/observations?

like image 247
SOP Avatar asked Oct 20 '22 22:10

SOP


1 Answers

Ok, let's tackle your questions one by one.

tree = Tree.parse("(S (NP The/DT bag/NN) is/VBZ (JP blue/JJ))")

Deleting a node:

tree.remove(Tree('JP', ['blue/JJ']))

tree.remove('is/VBZ')

Modifying a value. You could do this by getting the index of a member of the Tree (remember, it inherits list):

tree.index('is/VBZ')

but again, this is not a good approach.

The best way in traversing the leaves is getting the leaves with tree.leaves() and then getting the indexes by tree.leaf_treeposition(index), and using these to modify/delete the leaf in-place.

like image 135
Viktor Vojnovski Avatar answered Oct 24 '22 01:10

Viktor Vojnovski