Just to clarify, this isn't homework. I've been asked for help on this and am unable to do it, so it turned into a personal quest to solve it.
Imagine you have a grammar for an English sentence like this:
S => NP VP | VP
NP => N | Det N | Det Adj N
VB => V | V NP
N => i you bus cake bear
V => hug love destroy am
Det => a the
Adj => pink stylish
I've searched for several hours and really am out of ideas. I found articles talking about shallow parsing, depth-first backtracking and related things, and while I'm familiar with most of them, I still can't apply them to this problem. I tagged Lisp and Haskell because those are the languages I plan to implement this in, but I don't mind if you use other languages in your replies.
I'd appreciate hints, good articles and everything in general.
Here's a working Haskell example. It turns out there's a few tricks to learn before you can make it work! The zeroth thing to do is boilerplate: turn off the dreaded monomorphism restriction, import some libraries, and define some functions that aren't in the libraries (but should be):
{-# LANGUAGE NoMonomorphismRestriction #-}
import Control.Applicative ((<*))
import Control.Monad
import Text.ParserCombinators.Parsec
ensure p x = guard (p x) >> return x
singleToken t = tokenPrim id (\pos _ _ -> incSourceColumn pos 1) (ensure (==t))
anyOf xs = choice (map singleToken xs)
Now that the zeroth thing is done... first, we define a data type for our abstract syntax trees. We can just follow the shape of the grammar here. However, to make it more convenient, I've factored a few of the grammar's rules; in particular, the two rules
NP => N | Det N | Det Adj N
VB => V | V NP
are more conveniently written this way when it comes to actually writing a parser:
NP => N | Det (Adj | empty) N
VB => V (NP | empty)
Any good book on parsing will have a chapter on why this kind of factoring is a good idea. So, the AST type:
data Sentence
= Complex NounPhrase VerbPhrase
| Simple VerbPhrase
data NounPhrase
= Short Noun
| Long Article (Maybe Adjective) Noun
data VerbPhrase
= VerbPhrase Verb (Maybe NounPhrase)
type Noun = String
type Verb = String
type Article = String
type Adjective = String
Then we can make our parser. This one follows the (factored) grammar even more closely! The one wrinkle here is that we always want our parser to consume an entire sentence, so we have to explicitly ask for it to do that by demanding an "eof" -- or end of "file".
s = (liftM2 Complex np vp <|> liftM Simple vp) <* eof
np = liftM Short n <|> liftM3 Long det (optionMaybe adj) n
vp = liftM2 VerbPhrase v (optionMaybe np)
n = anyOf ["i", "you", "bus", "cake", "bear"]
v = anyOf ["hug", "love", "destroy", "am"]
det = anyOf ["a", "the"]
adj = anyOf ["pink", "stylish"]
The last piece is the tokenizer. For this simple application, we'll just tokenize based on whitespace, so the built-in words
function works just fine. Let's try it out! Load the whole file in ghci:
*Main> parse s "stdin" (words "i love the pink cake")
Right (Complex (Short "i") (VerbPhrase "love" (Just (Long "the" (Just "pink") "cake"))))
*Main> parse s "stdin" (words "i love pink cake")
Left "stdin" (line 1, column 3):
unexpected "pink"
expecting end of input
Here, Right
indicates a successful parse, and Left
indicates an error. The "column" number reported in the error is actually the word number where the error occurred, due to the way we're computing source positions in singleToken
.
There are several different approaches for syntactic parsing using a Context-free grammar.
If you want to implement this yourself you could start by familiarizing yourself with parsing algorithms: you can have a look here and here, or if you prefer something on paper the chapter on Syntactic Parsing in Jurafsky&Martin might be a good start.
I know that it is not too difficult to implement a simple syntactic parser in the Prolog programming language. Just google for 'prolog shift reduce parser' or 'prolog scan predict parser'. I don't know Haskell or Lisp, but there might be similarities to prolog, so maybe you can get some ideas from there.
If you don't have to implement the complete parser on your own I'd have a look at the Python NLTK which offers tools for CFG-Parsing. There is a chapter about this in the NLTK book.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With