Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse parenthetical trees in python?

I need help with this developing this algorithm that I'm working on. I have a an input of a tree in the following format:

( Root ( AB ( ABC ) ( CBA ) ) ( CD ( CDE ) ( FGH ) ) )

This looks this the following tree.

                   Root
                     |
                ____________
              AB           CD
              |             |  
       __________         ___________
      ABC      CBA        CDE      FGH

What the algorithm is suppose to is read the parenthetical format in and give the following output:

Root -> AB CD
AB -> ABC CBA
CD -> CDE FGH

It list the root and its children and all other parents that have children. I am not able to understand how to start up on this, Can someone help me gimme hint or give some references or links?

like image 525
Hell Man Avatar asked Nov 03 '13 04:11

Hell Man


People also ask

How do you parse a tree in Python?

-80.435259246021 -23.831876011253 (S1 (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down))) (. .)))

What does parse () do in Python?

In this article, parsing is defined as the processing of a piece of python program and converting these codes into machine language. In general, we can say parse is a command for dividing the given program code into a small piece of code for analyzing the correct syntax.


3 Answers

Solution: the Tree class from module nltk

(aka Natural Language Toolkit)

Making the actual parsing

This is your input:

input = '( Root ( AB ( ABC ) ( CBA ) ) ( CD ( CDE ) ( FGH ) ) )'

And you parse it very simply by doing:

from nltk import Tree
t = Tree.fromstring(input)

Playing with the parsed tree

>>> t.label()
'Root'
>>> len(t)
2
>>> t[0]
Tree('AB', [Tree('ABC', []), Tree('CBA', [])])
>>> t[1]
Tree('CD', [Tree('CDE', []), Tree('FGH', [])])
>>> t[0][0]
Tree('ABC', [])
>>> t[0][1]
Tree('CBA', [])
>>> t[1][0]
Tree('CDE', [])
>>> t[1][1]
Tree('FGH', [])

As you seen, you can treat each node as a list of subtrees.

To pretty-print the tree

>>> t.pretty_print()
            Root            
      _______|________       
     AB               CD    
  ___|___          ___|___   
ABC     CBA      CDE     FGH
 |       |        |       |  
...     ...      ...     ...

To obtain the output you want

from sys import stdout

def showtree(t):
    if (len(t) == 0):
        return
    stdout.write(t.label() + ' ->')
    for c in t:
        stdout.write(' ' + c.label())
    stdout.write('\n')
    for c in t:
        showtree(c)

Usage:

>>> showtree(t)
Root -> AB CD
AB -> ABC CBA
CD -> CDE FGH

To install the module

pip install nltk

(Use sudo if required)

like image 176
Kubuntuer82 Avatar answered Oct 19 '22 06:10

Kubuntuer82


A recursive descent parser is a simple form of parser that can parse many grammars. While the entire theory of parsing is too large for a stack-overflow answer, the most common approach to parsing involves two steps: first, tokenisation, which extracts subwords of your string (here, probably words like 'Root', and 'ABC', or brackets like '(' and ')'), and then parsing using recursive functions.

This code parses input (like your example), producing a so-called parse tree, and also has a function 'show_children' which takes the parse tree, and produces the children view of the expression as your question asked.

import re

class ParseError(Exception):
    pass

# Tokenize a string.
# Tokens yielded are of the form (type, string)
# Possible values for 'type' are '(', ')' and 'WORD'
def tokenize(s):
    toks = re.compile(' +|[A-Za-z]+|[()]')
    for match in toks.finditer(s):
        s = match.group(0)
        if s[0] == ' ':
            continue
        if s[0] in '()':
            yield (s, s)
        else:
            yield ('WORD', s)


# Parse once we're inside an opening bracket.
def parse_inner(toks):
    ty, name = next(toks)
    if ty != 'WORD': raise ParseError
    children = []
    while True:
        ty, s = next(toks)
        if ty == '(':
            children.append(parse_inner(toks))
        elif ty == ')':
            return (name, children)

# Parse this grammar:
# ROOT ::= '(' INNER
# INNER ::= WORD ROOT* ')'
# WORD ::= [A-Za-z]+
def parse_root(toks):
    ty, _ = next(toks)
    if ty != '(': raise ParseError
    return parse_inner(toks)

def show_children(tree):
    name, children = tree
    if not children: return
    print '%s -> %s' % (name, ' '.join(child[0] for child in children))
    for child in children:
        show_children(child)

example = '( Root ( AB ( ABC ) ( CBA ) ) ( CD ( CDE ) ( FGH ) ) )'
show_children(parse_root(tokenize(example)))
like image 20
Paul Hankin Avatar answered Oct 19 '22 07:10

Paul Hankin


Try this :

def toTree(expression):
    tree = dict()
    msg =""
    stack = list()
    for char in expression:
        if(char == '('):
            stack.append(msg)
            msg = ""
        elif char == ')':
            parent = stack.pop()
            if parent not in tree:
                tree[parent] = list()
            tree[parent].append(msg)
            msg = parent
        else:
            msg += char
    return tree

expression = "(Root(AB(ABC)(CBA))(CD(CDE)(FGH)))"
print toTree(expression)

It returns a dictionary, where the root can be accessed with the key ''. You can then do a simple BFS to print the output.

OUTPUT :

{
''    : ['Root'], 
'AB'  : ['ABC', 'CBA'], 
'Root': ['AB', 'CD'], 
'CD'  : ['CDE', 'FGH']
}

You will have to eliminate all the whitespaces in the Expression before you start, or ignore the inrrelevant charaters in the expression by adding the following as the very first line in the for-loop :

if char == <IRRELEVANT CHARACTER>:
    continue

The above code will run in O(n) time, where n is the length of the expression.

EDIT

Here is the printing function :

def printTree(tree, node):
    if node not in tree:
        return 
    print '%s -> %s' % (node, ' '.join(child for child in tree[node])) 
    for child in tree[node]:
        printTree(tree, child) 

The desired Output can be achieved by the following :

expression = "(Root(AB(ABC)(CBA))(CD(CDE)(FGH)))"
tree = toTree(expression)
printTree(tree, tree[''][0])

Output

Root -> AB CD
AB -> ABC CBA
CD -> CDE FGH

EDIT

Assuming the node names are not unique, we just have to give new names to the nodes. This can be done using :

def parseExpression(expression):
    nodeMap = dict()
    counter = 1
    node = ""
    retExp =""
    for char in expression:
        if char == '(' or char == ')' :
            if (len(node) > 0):
                nodeMap[str(counter)] = node;
                retExp += str(counter)
                counter +=1
            retExp += char
            node =""
        elif char == ' ': continue
        else :
            node += char
    return retExp,nodeMap

The print Function will now change to :

def printTree(tree, node, nodeMap):
    if node not in tree:
        return 
    print '%s -> %s' % (nodeMap[node], ' '.join(nodeMap[child] for child in tree[node])) 
    for child in tree[node]:
        printTree(tree, child, nodeMap)

The output can be obtained by using :

expression = " ( Root( SQ ( VBZ ) ( NP ( DT ) ( NN ) ) ( VP ( VB ) ( NP ( NN ) ) ) ))"
expression, nodeMap = parseExpression(expression)
tree = toTree(expression)
printTree(tree, tree[''][0], nodeMap)

Output :

Root -> SQ
SQ -> VBZ NP VP
NP -> DT NN
VP -> VB NP
NP -> NN
like image 32
Kyuubi Avatar answered Oct 19 '22 06:10

Kyuubi