Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Picking blocks out of largely free-form text with FParsec

Tags:

f#

fparsec

I'm trying to parse some information out of largely free-form text. I attempted an implementation in FParsec, but I haven't used it before and I'm not sure if I'm doing it wrong, or even if it is well-suited to this particular problem.

Problem description

I want to parse out the contents of a particular set of Liquid tags from a markdown document ("examplecode" and "requiredcode" tags). The markdown will be mainly free-form text with the occasional block within Liquid tags, for example:

Some free form text.
Possibly lots of lines. Maybe `code` stuff.

{% examplecode opt-lang-tag %}
ABC
DEF
{% endexamplecode %}

More text. Possibly multilines.

{% othertag %}
can ignore this tag
{% endothertag %}

{% requiredcode %}
GHI
{% endrequiredcode %}

In this case I need to parse out [ "ABC\nDEF"; "GHI" ].

The parsing logic I'm after can be expressed imperatively. Loop through each line, if we find a start tag we're interested in, take lines until we match the closing tag and add those lines to the list of results, otherwise skip lines until the next start tag. Repeat.

This can be done with a loop or fold, or with a regular expression:

\{%\s*(examplecode|requiredcode).*\%}(.*?)\{%\s*end\1\s*%\}

My FParsec attempt

I found it difficult to express the logic above in FParsec. I wanted to write something like between s t (everythingUntil t), but I don't know how to implement that without everythingUntil consuming the end token, causing between to fail.

I ended up with the following, which doesn't handle nested occurrences of "{%", but seems to pass the main test cases I care about:

let trimStr (s : string) = s.Trim()
let betweenStr s t = between (pstring s) (pstring t)
let allTill s = charsTillString s false maxInt
let skipAllTill s = skipCharsTillString s false maxInt
let word : Parser<string, unit> = many1Satisfy (not << Char.IsWhiteSpace)

type LiquidTag = private LiquidTag of name : string * contents : string
let makeTag n c = LiquidTag (n, trimStr c)

let liquidTag =
    let pStartTag = betweenStr "{%" "%}" (spaces >>. word .>> spaces .>> skipAllTill "%}")
    let pEndTag tagName = betweenStr "{%" "%}" (spaces >>. pstring ("end" + tagName) .>> spaces)
    let tagContents = allTill "{%"
    pStartTag >>= fun name -> 
                    tagContents 
                        .>> pEndTag name 
                        |>> makeTag name

let tags = many (skipAllTill "{%" >>. liquidTag)

I can then filter tags to only include the ones I'm interested in.

This does a lot more than a basic implementation (like a regex) does, such as descriptive error reporting and more strict validation of input format (this is both good and bad).

One consequence of the stricter format is parsing fails on nested "{%" substrings within tags. I'm not sure how I'd adjust it to handle this case (should give [ "ABC {% DEF " ]):

{% examplecode %}
ABC {% DEF
{% endexamplecode %}

Question

Is there a way to more closely express the logic described in the "Problem description" section in FParsec, or does the free-form nature of the input make FParsec less suited to this than a more basic loop or regex?

(I'm also interested in ways to allow nested "{%" strings in tags, and improvements to my FParsec attempt. I'm happy to split that out into other questions as required.)

like image 313
David Tchepak Avatar asked Feb 20 '14 23:02

David Tchepak


1 Answers

I'd just use start >>. everythingUntil end instead of between start end body.

The following implementation is relatively close to the logic in the regex:

let maxInt = System.Int32.MaxValue    
type LiquidTag = LiquidTag of string * string

let skipTillString str = skipCharsTillString str true maxInt

let skipTillStringOrEof str : Parser<unit, _> =
    fun stream -> 
        let mutable found = false
        stream.SkipCharsOrNewlinesUntilString(str, maxInt, &found) |> ignore
        Reply(())

let openingBrace = skipString "{%" >>. spaces

let tagName name = 
    skipString name 
    >>? nextCharSatisfies (fun c -> c = '%' || System.Char.IsWhiteSpace(c))

let endTag name =     
    openingBrace >>? (tagName ("end" + name) >>. (spaces >>. skipString "%}"))

let tagPair_afterOpeningBrace name = 
   tagName name  >>. skipTillString "%}"
   >>. (manyCharsTill anyChar (endTag name)
        |>> fun str -> LiquidTag(name, str))

let skipToOpeningBraceOrEof = skipTillStringOrEof "{%" 

let tagPairs =
    skipToOpeningBraceOrEof 
    >>. many (openingBrace
              >>. opt (    tagPair_afterOpeningBrace "examplecode"
                       <|> tagPair_afterOpeningBrace "requiredcode")
              .>> skipToOpeningBraceOrEof)
        |>> List.choose id
   .>> eof

Some notes:

  • I only parse the two Liquid statements you're interested in. This makes a difference if one of these statements is nested inside a statement you're not interested in. It also has the advantage that no parsers have to be constructed while the parser is running.

  • I'm using the >>? combinator to control when exactly backtracking may occur.

  • The performance of this implementation will not be great, but there are various ways to optimize it if necessary. The slowest component will probably be the manyCharsTill anyChar (endTag name) parser, which could be easily replaced with a custom primitive. The many ... |> List.choose id in tagPairs could also be easily replaced with a more efficient custom combinator.

like image 190
Stephan Tolksdorf Avatar answered Oct 12 '22 21:10

Stephan Tolksdorf