I'm playing around with a toy HTML parser, to help familiarize myself with Scala's parsing combinators library:
import scala.util.parsing.combinator._
sealed abstract class Node
case class TextNode(val contents : String) extends Node
case class Element(
val tag : String,
val attributes : Map[String,Option[String]],
val children : Seq[Node]
) extends Node
object HTML extends RegexParsers {
val node: Parser[Node] = text | element
val text: Parser[TextNode] = """[^<]+""".r ^^ TextNode
val label: Parser[String] = """(\w[:\w]*)""".r
val value : Parser[String] = """("[^"]*"|\w+)""".r
val attribute : Parser[(String,Option[String])] = label ~ (
"=" ~> value ^^ Some[String] | "" ^^ { case _ => None }
) ^^ { case (k ~ v) => k -> v }
val element: Parser[Element] = (
("<" ~> label ~ rep(whiteSpace ~> attribute) <~ ">" )
~ rep(node) ~
("</" ~> label <~ ">")
) ^^ {
case (tag ~ attributes ~ children ~ close) => Element(tag, Map(attributes : _*), children)
}
}
What I'm realizing I want is some way to make sure my opening and closing tags match.
I think to do that, I need some sort of flatMap
combinator ~ Parser[A] => (A => Parser[B]) => Parser[B]
,
so I can use the opening tag to construct the parser for the closing tag. But I don't see anything matching that signature in the library.
What's the proper way to do this?
You can write a method that takes a tag name and returns a parser for a closing tag with that name:
object HTML extends RegexParsers {
lazy val node: Parser[Node] = text | element
val text: Parser[TextNode] = """[^<]+""".r ^^ TextNode
val label: Parser[String] = """(\w[:\w]*)""".r
val value : Parser[String] = """("[^"]*"|\w+)""".r
val attribute : Parser[(String, Option[String])] = label ~ (
"=" ~> value ^^ Some[String] | "" ^^ { case _ => None }
) ^^ { case (k ~ v) => k -> v }
val openTag: Parser[String ~ Seq[(String, Option[String])]] =
"<" ~> label ~ rep(whiteSpace ~> attribute) <~ ">"
def closeTag(name: String): Parser[String] = "</" ~> name <~ ">"
val element: Parser[Element] = openTag.flatMap {
case (tag ~ attrs) =>
rep(node) <~ closeTag(tag) ^^
(children => Element(tag, attrs.toMap, children))
}
}
Note that you also need to make node
lazy. Now you get a nice clean error message for unmatched tags:
scala> HTML.parse(HTML.element, "<a></b>")
res0: HTML.ParseResult[Element] =
[1.6] failure: `a' expected but `b' found
<a></b>
^
I've been a little more verbose than necessary for the sake of clarity. If you want concision you can skip the openTag
and closeTag
methods and write element
like this, for example:
val element = "<" ~> label ~ rep(whiteSpace ~> attribute) <~ ">" >> {
case (tag ~ attrs) =>
rep(node) <~ "</" ~> tag <~ ">" ^^
(children => Element(tag, attrs.toMap, children))
}
I'm sure more concise versions would be possible, but in my opinion even this is edging toward unreadability.
There is a flatMap
on Parser, and also an equivalent method named into
and an operator >>
, which might be more convenient aliases (flatMap
is still needed when used in for comprehensions). It is indeed a valid way to do what you're looking for.
Alternatively, you can check that the tags match with ^?
.
You are looking at the wrong place. It's a normal mistake, though. You want a method Parser[A] => (A => Parser[B]) => Parser[B]
, but you looked at the docs of Parsers
, not Parser
.
Look here.
There's a flatMap
, also known as into
or >>
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With