Understanding the tilde in Scala's parser combinators

Question

I'm fairly new to Scala and while reading about parser combinators(The Magic Behind Parser Combinators, Domain-Specific Languages in Scala) I came across method definitions like this:

def classPrefix = "class" ~ ID ~ "(" ~ formals ~ ")"

I've been reading throught the API doc of scala.util.parsing.Parsers which defines a method named (tilde) but I still dont't really understand its usage in the example above. In that example (tilde) is a method which is called on java.lang.String which doesn't have that method and causes the compiler to fail. I know that (tilde) is defined as

case class ~ [+a, +b] (_1: a, _2: b)

but how does this help in the example above?

I'd be happy if someone could give me a hint to understand what's going on here. Thank you very much in advance!

Jan

Rex Kerr · Accepted Answer

The structure here is a little bit tricky. First, notice that you always define these things inside a subclass of some parser, e.g. class MyParser extends RegexParsers. Now, you may note two implicit definitions inside RegexParsers:

implicit def literal (s: String): Parser[String]
implicit def regex (r: Regex): Parser[String]

What these will do is take any string or regex and convert them into a parser that matches that string or that regex as a token. They're implicit, so they'll be applied any time they're needed (e.g. if you call a method on Parser[String] that String (or Regex) does not have).

But what is this Parser thing? It's an inner class defined inside Parsers, the supertrait for RegexParser:

class Parser [+T] extends (Input) ⇒ ParseResult[T]

Looks like it's a function that takes input and maps it to a result. Well, that makes sense! And you can see the documentation for it here.

Now we can just look up the ~ method:

def ~ [U] (q: ⇒ Parser[U]): Parser[~[T, U]]
  A parser combinator for sequential composition
  p ~ q' succeeds if p' succeeds and q' succeeds on the input left over by p'.

So, if we see something like

def seaFacts = "fish" ~ "swim"

what happens is, first, "fish" does not have the ~ method, so it's implicitly converted to Parser[String] which does. The ~ method then wants an argument of type Parser[U], and so we implicitly convert "swim" into Parser[String] (i.e. U == String). Now we have something that will match an input "fish", and whatever is left in the input should match "swim", and if both are the case, then seaFacts will succeed in its match.

Didier Dupont · Answer

The ~ method on parser combines two parser in one which applies the two original parsers successively and returns the two results. That could be simply (in Parser[T])

def ~[U](q: =>Parser[U]): Parser[(T,U)].

If you never combined more than two parsers, that would be ok. However, if you chain three of them, p1, p2, p3, with return types T1, T2, T3, then p1 ~ p2 ~ p3, which means p1.~(p2).~(p3) is of type Parser[((T1, T2), T3)]. And if you combine five of them as in your example, that would be Parser[((((T1, T2), T3), T4), T5)]. Then when you pattern match on the result, you would have all those parantheses too :

case ((((_, id), _), formals), _) => ...

This is quite uncomfortable.

Then comes a clever syntactic trick. When a case class has two parameters, it can appears in infix rather than prefix position in a pattern. That is, if you have case class X(a: A, b: B), you can pattern match with case X(a, b), but also with case a X b. (That is what is done with a pattern x::xs to match a non empty List, :: is a case class). When you write case a ~ b ~ c, it means case ~(~(a,b), c), but is much more pleasant, and more pleasant than case ((a,b), c) too, which is tricky to get right.

So the ~ method in Parser returns a Parser[~[T,U]] instead of a Parser[(T,U)], so you can pattern match easily on the result of multiple ~. Beside that, ~[T,U] and (T,U) are pretty much the same thing, as isomorphic as you can get.

The same name is chosen for the combining method in parser and for the result type, because the resulting code is natural to read. One sees immediately how each part in the result processing relates to the items of the grammar rule.

parser1 ~ parser2 ~ parser3 ^^ {case part1 ~ part2 ~ part3 => ...}

Tilda is chosen because its precedence (it binds tightly) plays nicely with the other operators on parser.

One last point, there are auxiliary operators ~> and <~ which discard the result of one of the operand, typically the constant parts in the rule which carries no useful data. So one would rather write

"class" ~> ID <~ ")" ~ formals <~ ")"

and get only the values of ID and formals in the result.

Eugene Yokota · Answer

You should checkout Parsers.Parser. Scala sometimes defines method and case class with the same name to aid pattern matching etc, and it's a little confusing if you're reading the Scaladoc.

In particular, "class" ~ ID is same as "class".~(ID). ~ is a method that combines the parser with another parser sequentially.

There's an implicit conversion defined in RegexParsers that automatically creates a parser from a String value. So, "class" automatically becomes an instance of Parser[String].

val ID = """[a-zA-Z]([a-zA-Z0-9]|_[a-zA-Z0-9])*"""r

RegexParsers also defines another implicit conversion that automatically creates parser from a Regex value. So, ID automatically becomes an instance of Parser[String] too.

By combining two parsers, "class" ~ ID returns a Parser[String] that matches the literal "class" and then the regular expression ID appearing sequentially. There are other methods like | and |||. For more info, read Programming in Scala.

Understanding the tilde in Scala's parser combinators

Tags:

scala

parser-combinators

Jano

3 Answers

Rex Kerr

Didier Dupont

Eugene Yokota

Recent Activity

Donate For Us

Understanding the tilde in Scala's parser combinators

Tags:

scala

parser-combinators

Jano

3 Answers

Rex Kerr

Didier Dupont

Eugene Yokota

Related questions

Recent Activity

Donate For Us