Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse a list of words according to a simplified grammar?

Just to clarify, this isn't homework. I've been asked for help on this and am unable to do it, so it turned into a personal quest to solve it.

Imagine you have a grammar for an English sentence like this:

S => NP VP | VP
NP => N | Det N | Det Adj N
VB => V | V NP
N => i you bus cake bear
V => hug love destroy am
Det => a the
Adj => pink stylish

I've searched for several hours and really am out of ideas. I found articles talking about shallow parsing, depth-first backtracking and related things, and while I'm familiar with most of them, I still can't apply them to this problem. I tagged Lisp and Haskell because those are the languages I plan to implement this in, but I don't mind if you use other languages in your replies.

I'd appreciate hints, good articles and everything in general.

like image 860
Ben M Avatar asked Oct 18 '11 07:10

Ben M


2 Answers

Here's a working Haskell example. It turns out there's a few tricks to learn before you can make it work! The zeroth thing to do is boilerplate: turn off the dreaded monomorphism restriction, import some libraries, and define some functions that aren't in the libraries (but should be):

{-# LANGUAGE NoMonomorphismRestriction #-}
import Control.Applicative ((<*))
import Control.Monad
import Text.ParserCombinators.Parsec

ensure p x = guard (p x) >> return x
singleToken t = tokenPrim id (\pos _ _ -> incSourceColumn pos 1) (ensure (==t))
anyOf xs = choice (map singleToken xs)

Now that the zeroth thing is done... first, we define a data type for our abstract syntax trees. We can just follow the shape of the grammar here. However, to make it more convenient, I've factored a few of the grammar's rules; in particular, the two rules

NP => N | Det N | Det Adj N
VB => V | V NP

are more conveniently written this way when it comes to actually writing a parser:

NP => N | Det (Adj | empty) N
VB => V (NP | empty)

Any good book on parsing will have a chapter on why this kind of factoring is a good idea. So, the AST type:

data Sentence
    = Complex NounPhrase VerbPhrase
    | Simple VerbPhrase
data NounPhrase
    = Short Noun
    | Long Article (Maybe Adjective) Noun
data VerbPhrase
    = VerbPhrase Verb (Maybe NounPhrase)
type Noun      = String
type Verb      = String
type Article   = String
type Adjective = String

Then we can make our parser. This one follows the (factored) grammar even more closely! The one wrinkle here is that we always want our parser to consume an entire sentence, so we have to explicitly ask for it to do that by demanding an "eof" -- or end of "file".

s   = (liftM2 Complex np vp <|> liftM Simple vp) <* eof
np  = liftM Short n <|> liftM3 Long det (optionMaybe adj) n
vp  = liftM2 VerbPhrase v (optionMaybe np)
n   = anyOf ["i", "you", "bus", "cake", "bear"]
v   = anyOf ["hug", "love", "destroy", "am"]
det = anyOf ["a", "the"]
adj = anyOf ["pink", "stylish"]

The last piece is the tokenizer. For this simple application, we'll just tokenize based on whitespace, so the built-in words function works just fine. Let's try it out! Load the whole file in ghci:

*Main> parse s "stdin" (words "i love the pink cake")
Right (Complex (Short "i") (VerbPhrase "love" (Just (Long "the" (Just "pink") "cake"))))
*Main> parse s "stdin" (words "i love pink cake")
Left "stdin" (line 1, column 3):
unexpected "pink"
expecting end of input

Here, Right indicates a successful parse, and Left indicates an error. The "column" number reported in the error is actually the word number where the error occurred, due to the way we're computing source positions in singleToken.

like image 183
Daniel Wagner Avatar answered Sep 20 '22 05:09

Daniel Wagner


There are several different approaches for syntactic parsing using a Context-free grammar.

If you want to implement this yourself you could start by familiarizing yourself with parsing algorithms: you can have a look here and here, or if you prefer something on paper the chapter on Syntactic Parsing in Jurafsky&Martin might be a good start.

I know that it is not too difficult to implement a simple syntactic parser in the Prolog programming language. Just google for 'prolog shift reduce parser' or 'prolog scan predict parser'. I don't know Haskell or Lisp, but there might be similarities to prolog, so maybe you can get some ideas from there.

If you don't have to implement the complete parser on your own I'd have a look at the Python NLTK which offers tools for CFG-Parsing. There is a chapter about this in the NLTK book.

like image 28
tobigue Avatar answered Sep 21 '22 05:09

tobigue