Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala Parser Combinators: Efficiently Parse C-Style Comments

What is the best way to (efficiently) parse C-style multi-line comments (i.e., /* ... */) with Scala parser combinators?

In a project that I'm involved in we parse a C-like programming language, and want to support multi-line comments. We use a subclass of StandardTokenParsers, which already handles such comments (via StdLexical. However, the class only works for fairly short multi-line comments, and runs out of stack space otherwise.

We have also tried providing our own definition of whitespace to make things more efficient. We used a RegexParser (inspired by another question on StackOverflow) as follows:

class Parser extends StandardTokenParsers {

  override val lexical = new StdLexical {
    def rp: RegexParsers = new RegexParsers {}
    override val whitespace: Parser[Any] = rp.regex("""(\s|//.*|(?m)/\*(\*(?!/)|[^*])*\*/)*""".r).asInstanceOf[Parser[Any]]
  }

  // ...

}

This improved the situation slightly, but still causes a stack overflow if the comment is more than a few dozen lines. Any ideas how to improve this?

like image 527
stefan Avatar asked Oct 06 '12 23:10

stefan


1 Answers

We have had some success with this sort of issue by defining whitespace skipping using parsers instead of using regular expressions. See the WhitespaceParser trait in our Kiama ParserUtilities.scala for some support code.

Most of the mucking about is to override the normal regular expression whitespace handling and to tie the new parser into the literal and regex combinators (we don't typically use the token parsers). See one of our examples for usage, in this case to handle nested comments.

like image 52
inkytonik Avatar answered Sep 21 '22 12:09

inkytonik