Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell Parsec skip all words that arent predefined

Tags:

haskell

parsec

Learning to use the Parsec library, part of homework.

EDIT: Suggestions to use other libraries are welcome, the point is the parsing.

What I want, is to extract all words with a capital letter, and four compass directions from any sentence. Example: "Belgium totally lies south of Holland." should find and return "Belgium south Holland".

What I can't figure is how to ignore (eat) any input that is -not- a compass direction. I was looking to find something along the lines of

'many (not compassDirection >> space)'

but g(h)oogle isn't helping me.

Following code is obviously stuck on the 'many' function.

readExpr :: String -> String
readExpr input = case parse (parseLine) "" input of
    Left err -> "No match: " ++ show err
    Right val -> "Found: " ++ showVal val

parseLine :: Parser GraphValue
parseLine = do
            x <- parseCountry
            space
            many ( some (noneOf " ") >> space )
            y <- parseCompass
            space
            many ( some (noneOf " ") >> space )
            z <- parseCountry
            return $ Direction [x,y,z]

compassDirection :: Parser String
compassDirection = string "north" <|>
                   string "south" <|>
                   string "east" <|>
                   string "west"

parseCountry :: Parser GraphValue
parseCountry = do 
                c <- upper 
                x <- many (lower)
                return $ Country (c:x)

parseCompass :: Parser GraphValue
parseCompass = do 
                x <- compassDirection
                return $ Compass x
like image 837
Taelia Avatar asked Oct 26 '12 13:10

Taelia


2 Answers

I won't go into specifics since this is homework and the OP said the "important thing is the parsing".


The way I'd solve this problem:

  • tokenize the input. Break it into words; this will free the real parsing step from having to worry about token definitions (i.e. "is %#@[ part of a word?") or whitespace. This could be as simple as words or you could use Parsec for the tokenization. Then you'll have [Token] (or [String] if you prefer).

  • a parser for compass directions. You already have this (good job), but it'll have to modified a bit if the input is [String] instead of String.

  • a parser for words that start with a capital letter.

  • a parser for everything else, that succeeds whenever it sees a token that isn't a compass direction or a word starting with a caps.

  • a parser that works on any token, but distinguishes between good stuff and bad stuff, perhaps using an algebraic data type.

  • a parser that works on lots of tokens

Hopefully that's clear without being too clear; you'll still have to worry about when to discard the junk, for example. The basic idea is to break the problem down into lots of little sub-problems, solve the sub-problems, then glue those solutions together.

like image 171
Matt Fenwick Avatar answered Nov 15 '22 08:11

Matt Fenwick


I'm going to tell you how I would start and then advice about how I'd carry on.

I'd base this on an abstract data structure - as you add extra words you can classify them more closely:

data Word = Country String | Direction NSEW | Unclassified String
data NESW = North | East | South | West

so my answer to how you skip words you don't know about is that you don't need to - leave them as unclassified.

The applicative style is nicer than the monadic style.

I think compassDirection should allow capitals:

compassDirection :: Parser NESW
compassDirection = north <|> south <|> east <|> west where
    north = North <$ (string "north" <|> string "North")
    east = ...

You can define country using Country <$> ((:) <$> upper <*> many lower)

Then you can have a catch-all Unclassified <$> many letter.

Your word parser can currently be

word = compassDirection <|> country <|> unclassified

but notice that compassDirection has to come before country because otherwise country would match North.

You can do

words = word `sepBy1` space

at the moment that's OK, but you must must must not use word or words when you're analysing sentences more properly, because you lose control over what the word is. At that point you'd need noun, adjective, nounPhrase, verb, adjective, adjectivalPhrase etc to pasrse a sentence structure. Sentences that you can't parse mean you need to add new constructs to your grammar.

It's worth making word parsers swallow the whitespace after them (or before), or refactoring using a preprocessor that splits words from spaces and punctuation. Consider having a fullStop parser if you're British or a period parser if you're American. Use it when you create a sentence parser.

Using applicative and higher order functions will make it much clearer to write your grammar, because you'll not have it cluttered up with monadic notation, and it will look like sentences. Example: you could do nvn = NVN <$> noun <*> verb <*> noun if you want to use a primarily Abstract Data Structure (AST) approach with one constructor per grammatical object. If you just prefer to have a bunch of words lying around all of the same type, you can do nvn = sequence [noun,verb,noun].

Most computer languages are parsed using an AST approach, but I've no direct experience of natural language parsing beyond what I picked up second-hand from my wife's linguistics degree.

If you sit down and write how you can combine categories of words, phrases, clauses and sentences together, you'll be able to write the parser fairly quickly.

like image 27
AndrewC Avatar answered Nov 15 '22 06:11

AndrewC