Well in fact I'm pretty sure I'm using the wrong terminology. Here is the problem I want to solve: a parser for the markdown
format, well a subset of it.
My problem is with the blockquote feature. Each line in a blockquote starts with >
; otherwise everything is the normal structure in a markdown file.
You can't look at individual lines separately, because you need to separate paragraphs from normal lines, eg
> a
> b
is not the same as
> a
>
> b
and things like that (same if a list is blockquoted you don't want x lists but one list with x elements). A natural and trivial way is to "take off" the >
signs, parse the blockquote on its own, ignoring anything around it, wrap it with a BlockQuote
type constructor, put that in the outer AST and resume parsing of the original input. It's what pango
does if I'm not wrong:
https://hackage.haskell.org/package/pandoc-1.14.0.4/docs/src/Text-Pandoc-Readers-Markdown.html#blockQuote
blockQuote :: MarkdownParser (F Blocks)
blockQuote = do
raw <- emailBlockQuote
-- parse the extracted block, which may contain various block elements:
contents <- parseFromString parseBlocks $ (intercalate "\n" raw) ++ "\n\n"
return $ B.blockQuote <$> contents
And then:
http://hackage.haskell.org/package/pandoc-1.5.1/docs/src/Text-Pandoc-Shared.html#parseFromString
-- | Parse contents of 'str' using 'parser' and return result.
parseFromString :: GenParser tok st a -> [tok] -> GenParser tok st a
parseFromString parser str = do
oldPos <- getPosition
oldInput <- getInput
setInput str
result <- parser
setInput oldInput
setPosition oldPos
return result
Now parseFromString
looks quite hacky to me and besides that it's also Parsec
not attoparsec
so I can't use it in my project. I'm not sure how I could take that Text
from the blockquote and parse it and return the parsing result so that it "fits" within the current parsing. Seems impossible?
I've been googling on the issue and I think that pipes-parse
and conduit
can help on that area although I struggle to find examples and what I see appears considerably less nice to look at than "pure" parsec/attoparsec parsers.
Other options to parse blockquotes would be to rewrite the usual parsers but with the >
catch... Complicating and duplicating a lot. Parsing blockquotes counting each line separately and writing some messy "merge" function. Or parsing to a first AST that would contain the blockquotes as Text
inside a first BlockquoteText
type constructor waiting for a transformation where they would be parsed separately, not very elegant but it has the benefit of simplicity, which does count for something.
I would probably go for the latter, but surely there's a better way?
I have asked myself the same question. Why is there no standard combinator for nested parsers like how you describe? My default mode is to trust the package author, especially when that author also co-wrote "Real World Haskell". If such an obvious capability is missing, perhaps it is by design and I should look for a better way. However, I've managed to convince myself that such a convenient combinator is mostly harmless. Useful whenever an all-or-nothing type parser is appropriate for the inner parse.
import Data.Attoparsec.Text
import qualified Data.Text as T
import Data.Text(Text)
import Control.Applicative
I've divided the required functionality into two parsers. The first, constP
, performs an "in place" parse of some given text. It substitutes the constant parser's fail with empty
(from Alternative), but otherwise has no other side effects.
constP :: Parser a -> Text -> Parser a
constP p t = case parseOnly p t of
Left _ -> empty
Right a -> return a
The second part comes from parseOf
, which performs the constant, inner parse based on the result of the outer parse. The empty
alternative here allows a failed parse to return without consuming any input.
parseOf :: Parser Text -> Parser a -> Parser a
parseOf ptxt pa = bothParse <|> empty
where
bothParse = ptxt >>= constP pa
The block quote markdown can be written in the desired fashion. This implementation requires the resulting block to be totally parsed.
blockQuoteMarkdown :: Parser [[Double]]
blockQuoteMarkdown = parseOf blockQuote ( markdownSurrogate <*
endOfInput
)
Instead of the actual markdown parser, I just implemented a quick parser of space separated doubles. The complication of the parser comes from allowing the last, non-empty line, either end in a new line or not.
markdownSurrogate :: Parser [[Double]]
markdownSurrogate = do
lns <- many (mdLine <* endOfLine)
option lns ((lns ++) . pure <$> mdLine1)
where
mdLine = sepBy double (satisfy (==' '))
mdLine1 = sepBy1 double (satisfy (==' '))
These two parsers are responsible for returning the text internal to block quotes.
blockQuote :: Parser Text
blockQuote = T.unlines <$> many blockLine
blockLine :: Parser Text
blockLine = char '>' *> takeTill isEndOfLine <* endOfLine
Finally, a test of the parser.
parseMain :: IO ()
parseMain = do
putStrLn ""
doParse "a" markdownSurrogate a
doParse "_" markdownSurrogate ""
doParse "b" markdownSurrogate b
doParse "ab" markdownSurrogate ab
doParse "a_b" markdownSurrogate a_b
doParse "badMarkdown x" markdownSurrogate x
doParse "badMarkdown axb" markdownSurrogate axb
putStrLn ""
doParse "BlockQuote ab" blockQuoteMarkdown $ toBlockQuote ab
doParse "BlockQuote a_b" blockQuoteMarkdown $ toBlockQuote a_b
doParse "BlockQuote axb" blockQuoteMarkdown $ toBlockQuote axb
where
a = "7 3 1"
b = "4 4 4"
x = "a b c"
ab = T.unlines [a,b]
a_b = T.unlines [a,"",b]
axb = T.unlines [a,x,b]
doParse desc p str = do
print $ T.concat ["Parsing ",desc,": \"",str,"\""]
let i = parse (p <* endOfInput ) str
print $ feed i ""
toBlockQuote = T.unlines
. map (T.cons '>')
. T.lines
*Main> parseMain
"Parsing a: \"7 3 1\""
Done "" [[7.0,3.0,1.0]]
"Parsing _: \"\""
Done "" []
"Parsing b: \"4 4 4\""
Done "" [[4.0,4.0,4.0]]
"Parsing ab: \"7 3 1\n4 4 4\n\""
Done "" [[7.0,3.0,1.0],[4.0,4.0,4.0]]
"Parsing a_b: \"7 3 1\n\n4 4 4\n\""
Done "" [[7.0,3.0,1.0],[],[4.0,4.0,4.0]]
"Parsing badMarkdown x: \"a b c\""
Fail "a b c" [] "endOfInput"
"Parsing badMarkdown axb: \"7 3 1\na b c\n4 4 4\n\""
Fail "a b c\n4 4 4\n" [] "endOfInput"
"Parsing BlockQuote ab: \">7 3 1\n>4 4 4\n\""
Done "" [[7.0,3.0,1.0],[4.0,4.0,4.0]]
"Parsing BlockQuote a_b: \">7 3 1\n>\n>4 4 4\n\""
Done "" [[7.0,3.0,1.0],[],[4.0,4.0,4.0]]
"Parsing BlockQuote axb: \">7 3 1\n>a b c\n>4 4 4\n\""
Fail ">7 3 1\n>a b c\n>4 4 4\n" [] "Failed reading: empty"
The notable difference comes in the semantics of failure. For instance, when parsing axb
and blockquoted axb
, which are the following two strings, respectively
7 3 1
a b c
4 4 4
and
> 7 3 1
> a b c
> 4 4 4
the markdown parse results in
Fail "a b c\n4 4 4\n" [] "endOfInput"
whereas the quoted results in
Fail ">7 3 1\n>a b c\n>4 4 4\n" [] "Failed reading: empty"
The markdown consumes "7 3 1\n", but this is nowhere reported in the quoted failure. Instead, fail becomes all or nothing.
Likewise, there is no allowance for handling unparsed text in the case of partial success. But I don't see a need for this, given the use case. For example, if a parse looked something like the following
"{ <tok> unhandled }more to parse"
where {}
denotes the recognized block quote context, and <tok>
is parsed within that inner context. A partial success then would have to lift "unhandled" out of that block quote context and somehow combine it with "more to parse".
I see no general way to do this, but it is allowed through choice of inner parser return type. For instance, by some parser parseOf blockP innP :: Parser (<tok>,Maybe Text)
. However, if this need arises I would expect that there is a better way to handle the situation than with nested parsers.
There may also be concerns about the loss of attoparsec Partial parsing. That is, the implementation of constP
uses parseOnly
, which collapses the parse return Fail
and Partial
into a single Left
failure state. In other words, we lose the ability to feed our inner parser with more text as it becomes available. However, note that text to parse is itself the result of an outer parse; it will only be available after enough text has been fed to the outer parse. So this shouldn't be an issue either.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With