Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match alphanumeric string in nltk grammar

I'm trying to use NTLK grammar and parse algorithms as they seem pretty simple to use. Though, I can't find a way to match an alphanumeric string properly, something like:

import nltk
grammar = nltk.parse_cfg ("""
# Is this possible?
TEXT -> \w*  
""")

parser = nltk.RecursiveDescentParser(grammar)

print parser.parse("foo")

Is there an easy way to achieve this?

like image 231
finiteautomata Avatar asked Oct 21 '22 11:10

finiteautomata


1 Answers

It would be very difficult to do cleanly. The base parser classes rely on exact matches or the production RHS to pop content, so it would require subclassing and rewriting large parts of the parser class. I attempted it a while ago with the feature grammar class and gave up.

What I did instead is more of a hack, but basically, I extract the regex matches from the text first, and add them to the grammar as productions. It will be very slow if you are using a large grammar since it needs to recompute the grammar and parser for every call.

import re

import nltk
from nltk.grammar import Nonterminal, Production, ContextFreeGrammar

grammar = nltk.parse_cfg ("""
S -> TEXT
TEXT -> WORD | WORD TEXT | NUMBER | NUMBER TEXT
""")

productions = grammar.productions()

def literal_production(key, rhs):
    """ Return a production <key> -> n 

    :param key: symbol for lhs:
    :param rhs: string literal:
    """
    lhs = Nonterminal(key)
    return Production(lhs, [rhs])

def parse(text):
    """ Parse some text.
"""

    # extract new words and numbers
    words = set([match.group(0) for match in re.finditer(r"[a-zA-Z]+", text)])
    numbers = set([match.group(0) for match in re.finditer(r"\d+", text)])

    # Make a local copy of productions
    lproductions = list(productions)

    # Add a production for every words and number
    lproductions.extend([literal_production("WORD", word) for word in words])
    lproductions.extend([literal_production("NUMBER", number) for number in numbers])

    # Make a local copy of the grammar with extra productions
    lgrammar = ContextFreeGrammar(grammar.start(), lproductions)

    # Load grammar into a parser
    parser = nltk.RecursiveDescentParser(lgrammar)

    tokens = text.split()

    return parser.parse(tokens)

print parse("foo hello world 123 foo")

Here's more background where this was discussed on the nltk-users group on google groups: https://groups.google.com/d/topic/nltk-users/4nC6J7DJcOc/discussion

like image 158
Jonathan Villemaire-Krajden Avatar answered Oct 24 '22 11:10

Jonathan Villemaire-Krajden