Match alphanumeric string in nltk grammar

Question

I'm trying to use NTLK grammar and parse algorithms as they seem pretty simple to use. Though, I can't find a way to match an alphanumeric string properly, something like:

import nltk
grammar = nltk.parse_cfg ("""
# Is this possible?
TEXT -> \w*  
""")

parser = nltk.RecursiveDescentParser(grammar)

print parser.parse("foo")

Is there an easy way to achieve this?

Jonathan Villemaire-Krajden · Accepted Answer

It would be very difficult to do cleanly. The base parser classes rely on exact matches or the production RHS to pop content, so it would require subclassing and rewriting large parts of the parser class. I attempted it a while ago with the feature grammar class and gave up.

What I did instead is more of a hack, but basically, I extract the regex matches from the text first, and add them to the grammar as productions. It will be very slow if you are using a large grammar since it needs to recompute the grammar and parser for every call.

import re

import nltk
from nltk.grammar import Nonterminal, Production, ContextFreeGrammar

grammar = nltk.parse_cfg ("""
S -> TEXT
TEXT -> WORD | WORD TEXT | NUMBER | NUMBER TEXT
""")

productions = grammar.productions()

def literal_production(key, rhs):
    """ Return a production <key> -> n 

    :param key: symbol for lhs:
    :param rhs: string literal:
    """
    lhs = Nonterminal(key)
    return Production(lhs, [rhs])

def parse(text):
    """ Parse some text.
"""

    # extract new words and numbers
    words = set([match.group(0) for match in re.finditer(r"[a-zA-Z]+", text)])
    numbers = set([match.group(0) for match in re.finditer(r"\d+", text)])

    # Make a local copy of productions
    lproductions = list(productions)

    # Add a production for every words and number
    lproductions.extend([literal_production("WORD", word) for word in words])
    lproductions.extend([literal_production("NUMBER", number) for number in numbers])

    # Make a local copy of the grammar with extra productions
    lgrammar = ContextFreeGrammar(grammar.start(), lproductions)

    # Load grammar into a parser
    parser = nltk.RecursiveDescentParser(lgrammar)

    tokens = text.split()

    return parser.parse(tokens)

print parse("foo hello world 123 foo")

Here's more background where this was discussed on the nltk-users group on google groups: https://groups.google.com/d/topic/nltk-users/4nC6J7DJcOc/discussion

Match alphanumeric string in nltk grammar

Tags:

python

parsing

nltk

finiteautomata

1 Answers

Jonathan Villemaire-Krajden

Recent Activity

Donate For Us

Match alphanumeric string in nltk grammar

Tags:

python

parsing

nltk

finiteautomata

1 Answers

Jonathan Villemaire-Krajden

Related questions

Recent Activity

Donate For Us