I'm trying to parse some information out of largely free-form text. I attempted an implementation in FParsec, but I haven't used it before and I'm not sure if I'm doing it wrong, or even if it is well-suited to this particular problem.
I want to parse out the contents of a particular set of Liquid tags from a markdown document ("examplecode" and "requiredcode" tags). The markdown will be mainly free-form text with the occasional block within Liquid tags, for example:
Some free form text.
Possibly lots of lines. Maybe `code` stuff.
{% examplecode opt-lang-tag %}
ABC
DEF
{% endexamplecode %}
More text. Possibly multilines.
{% othertag %}
can ignore this tag
{% endothertag %}
{% requiredcode %}
GHI
{% endrequiredcode %}
In this case I need to parse out [ "ABC\nDEF"; "GHI" ]
.
The parsing logic I'm after can be expressed imperatively. Loop through each line, if we find a start tag we're interested in, take lines until we match the closing tag and add those lines to the list of results, otherwise skip lines until the next start tag. Repeat.
This can be done with a loop or fold, or with a regular expression:
\{%\s*(examplecode|requiredcode).*\%}(.*?)\{%\s*end\1\s*%\}
I found it difficult to express the logic above in FParsec. I wanted to write something like between s t (everythingUntil t)
, but I don't know how to implement that without everythingUntil
consuming the end token, causing between
to fail.
I ended up with the following, which doesn't handle nested occurrences of "{%"
, but seems to pass the main test cases I care about:
let trimStr (s : string) = s.Trim()
let betweenStr s t = between (pstring s) (pstring t)
let allTill s = charsTillString s false maxInt
let skipAllTill s = skipCharsTillString s false maxInt
let word : Parser<string, unit> = many1Satisfy (not << Char.IsWhiteSpace)
type LiquidTag = private LiquidTag of name : string * contents : string
let makeTag n c = LiquidTag (n, trimStr c)
let liquidTag =
let pStartTag = betweenStr "{%" "%}" (spaces >>. word .>> spaces .>> skipAllTill "%}")
let pEndTag tagName = betweenStr "{%" "%}" (spaces >>. pstring ("end" + tagName) .>> spaces)
let tagContents = allTill "{%"
pStartTag >>= fun name ->
tagContents
.>> pEndTag name
|>> makeTag name
let tags = many (skipAllTill "{%" >>. liquidTag)
I can then filter tags to only include the ones I'm interested in.
This does a lot more than a basic implementation (like a regex) does, such as descriptive error reporting and more strict validation of input format (this is both good and bad).
One consequence of the stricter format is parsing fails on nested "{%"
substrings within tags. I'm not sure how I'd adjust it to handle this case (should give [ "ABC {% DEF " ]
):
{% examplecode %}
ABC {% DEF
{% endexamplecode %}
Is there a way to more closely express the logic described in the "Problem description" section in FParsec, or does the free-form nature of the input make FParsec less suited to this than a more basic loop or regex?
(I'm also interested in ways to allow nested "{%"
strings in tags, and improvements to my FParsec attempt. I'm happy to split that out into other questions as required.)
I'd just use start >>. everythingUntil end
instead of between start end body
.
The following implementation is relatively close to the logic in the regex:
let maxInt = System.Int32.MaxValue
type LiquidTag = LiquidTag of string * string
let skipTillString str = skipCharsTillString str true maxInt
let skipTillStringOrEof str : Parser<unit, _> =
fun stream ->
let mutable found = false
stream.SkipCharsOrNewlinesUntilString(str, maxInt, &found) |> ignore
Reply(())
let openingBrace = skipString "{%" >>. spaces
let tagName name =
skipString name
>>? nextCharSatisfies (fun c -> c = '%' || System.Char.IsWhiteSpace(c))
let endTag name =
openingBrace >>? (tagName ("end" + name) >>. (spaces >>. skipString "%}"))
let tagPair_afterOpeningBrace name =
tagName name >>. skipTillString "%}"
>>. (manyCharsTill anyChar (endTag name)
|>> fun str -> LiquidTag(name, str))
let skipToOpeningBraceOrEof = skipTillStringOrEof "{%"
let tagPairs =
skipToOpeningBraceOrEof
>>. many (openingBrace
>>. opt ( tagPair_afterOpeningBrace "examplecode"
<|> tagPair_afterOpeningBrace "requiredcode")
.>> skipToOpeningBraceOrEof)
|>> List.choose id
.>> eof
Some notes:
I only parse the two Liquid statements you're interested in. This makes a difference if one of these statements is nested inside a statement you're not interested in. It also has the advantage that no parsers have to be constructed while the parser is running.
I'm using the >>?
combinator to control when exactly backtracking
may occur.
The performance of this implementation will not be great, but there are various ways to optimize it if necessary. The slowest component will probably be the manyCharsTill anyChar (endTag name)
parser, which could be easily replaced with a custom primitive. The many ... |> List.choose id
in tagPairs
could also be easily replaced with a more efficient custom combinator.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With