Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala: Parsing matching token

Tags:

parsing

scala

I'm playing around with a toy HTML parser, to help familiarize myself with Scala's parsing combinators library:

import scala.util.parsing.combinator._ 
sealed abstract class Node                                                                    
case class TextNode(val contents : String)  extends Node                                      
case class Element(                                                                           
  val tag : String,                                                                           
  val attributes : Map[String,Option[String]],                                                
  val children : Seq[Node]                                                                    
)  extends Node                                                                               

object HTML extends RegexParsers {                                                            
  val node: Parser[Node] = text | element                                                     
  val text: Parser[TextNode] = """[^<]+""".r ^^ TextNode                                      
  val label: Parser[String] = """(\w[:\w]*)""".r                                              
  val value : Parser[String] = """("[^"]*"|\w+)""".r                                   
  val attribute : Parser[(String,Option[String])] = label ~ (                                 
        "=" ~> value ^^ Some[String] | "" ^^ { case _ => None }                        
      ) ^^ { case (k ~ v) => k -> v }                                                         
  val element: Parser[Element] = (                                                            
    ("<" ~> label ~ rep(whiteSpace ~> attribute) <~ ">" )                                     
      ~ rep(node) ~                                                                           
    ("</" ~> label <~ ">")                                                                    
  ) ^^ {                                                                                      
    case (tag ~ attributes ~ children ~ close) => Element(tag, Map(attributes : _*), children)
  }                                                                                           
}                                                                                             

What I'm realizing I want is some way to make sure my opening and closing tags match.

I think to do that, I need some sort of flatMap combinator ~ Parser[A] => (A => Parser[B]) => Parser[B], so I can use the opening tag to construct the parser for the closing tag. But I don't see anything matching that signature in the library.

What's the proper way to do this?

like image 957
rampion Avatar asked Mar 27 '12 18:03

rampion


3 Answers

You can write a method that takes a tag name and returns a parser for a closing tag with that name:

object HTML extends RegexParsers {                   
  lazy val node: Parser[Node] = text | element
  val text: Parser[TextNode] = """[^<]+""".r ^^ TextNode
  val label: Parser[String] = """(\w[:\w]*)""".r
  val value : Parser[String] = """("[^"]*"|\w+)""".r
  val attribute : Parser[(String, Option[String])] = label ~ (
        "=" ~> value ^^ Some[String] | "" ^^ { case _ => None }
      ) ^^ { case (k ~ v) => k -> v }

  val openTag: Parser[String ~ Seq[(String, Option[String])]] =
    "<" ~> label ~ rep(whiteSpace ~> attribute) <~ ">"

  def closeTag(name: String): Parser[String] = "</" ~> name <~ ">"

  val element: Parser[Element] = openTag.flatMap {
    case (tag ~ attrs) =>
      rep(node) <~ closeTag(tag) ^^
        (children => Element(tag, attrs.toMap, children))
  }
}

Note that you also need to make node lazy. Now you get a nice clean error message for unmatched tags:

scala> HTML.parse(HTML.element, "<a></b>")
res0: HTML.ParseResult[Element] = 
[1.6] failure: `a' expected but `b' found

<a></b>
     ^

I've been a little more verbose than necessary for the sake of clarity. If you want concision you can skip the openTag and closeTag methods and write element like this, for example:

val element = "<" ~> label ~ rep(whiteSpace ~> attribute) <~ ">" >> {
  case (tag ~ attrs) =>
    rep(node) <~ "</" ~> tag <~ ">" ^^
      (children => Element(tag, attrs.toMap, children))
}

I'm sure more concise versions would be possible, but in my opinion even this is edging toward unreadability.

like image 70
Travis Brown Avatar answered Nov 09 '22 19:11

Travis Brown


There is a flatMap on Parser, and also an equivalent method named into and an operator >>, which might be more convenient aliases (flatMap is still needed when used in for comprehensions). It is indeed a valid way to do what you're looking for.

Alternatively, you can check that the tags match with ^?.

like image 37
Didier Dupont Avatar answered Nov 09 '22 19:11

Didier Dupont


You are looking at the wrong place. It's a normal mistake, though. You want a method Parser[A] => (A => Parser[B]) => Parser[B], but you looked at the docs of Parsers, not Parser.

Look here.

There's a flatMap, also known as into or >>.

like image 24
Daniel C. Sobral Avatar answered Nov 09 '22 21:11

Daniel C. Sobral