I'm trying to separate a string by either ","
, ", and"
and "and"
, and then return whatever was in between. An example of what I have so far is as follows:
import Data.Attoparsec.Text
sepTestParser = nameSep ((takeWhile1 $ inClass "-'a-zA-Z") <* space)
nameSep p = p `sepBy` (string " and " <|> string ", and" <|> ", ")
main = do
print $ parseOnly sepTestParser "This test and that test, this test particularly."
I would like the output to be ["This test", "that test", "this test particularly."]
. I have a vague sense that what I'm doing is wrong, but I can't quite work out why.
Note: This answer is written in literate Haskell. Save it as Example.lhs
and load it in GHCi or similar.
The thing is, sepBy
is implemented as:
sepBy p s = liftA2 (:) p ((s *> sepBy1 p s) <|> pure []) <|> pure []
This means that the second parser s
will be called after the first parser has succeeded. This also means, that if you were to add whitespace to the class of characters, that, you would end up with
["This test and that test","this test particularly"]
since and
is now parseable by p
. This isn't easy to fix: you would need to look ahead as soon as you hit a space, and check if after an arbitrarily number of spaces an "and" follows, and if it does, stop parsing. Only then a parser written with sepBy
will work.
So lets write a parser that takes words instead (the rest of this answer is literate Haskell):
> {-# LANGUAGE OverloadedStrings #-}
> import Control.Applicative
> import Data.Attoparsec.Text
> import qualified Data.Text as T
> import Control.Monad (mzero)
> word = takeWhile1 . inClass $ "-'a-zA-Z"
>
> wordsP = fmap (T.intercalate " ") $ k `sepBy` many space
> where k = do
> a <- word
> if (a == "and") then mzero
> else return a
wordsP
now takes multiple words until it either hits something, that's not a word, or a word that equals "and". The returned mzero
will indicate a parsing failure, at which another parser can take over:
> andP = many space *> "and" *> many1 space *> pure()
>
> limiter = choice [
> "," *> andP,
> "," *> many1 space *> pure (),
> andP
> ]
limiter
is mostly the same parser you've already written, it's the same as the regex /,\s+and|,\s+|\s*and\s+/
.
Now we can actually use sepBy
, since our first parser doesn't overlap with the second anymore:
> test = "This test and that test, this test particular, and even that test"
>
> main = print $ parseOnly (wordsP `sepBy` limiter) test
The result is ["This test","that test","this test particular","even that test"]
, just as we wanted. Note that this particular parser doesn't preserve whitespace.
So whenever you want to create a parser with sepBy
, make sure that both parsers don't overlap.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With