What is the best way to (efficiently) parse C-style multi-line comments (i.e., /* ... */
) with Scala parser combinators?
In a project that I'm involved in we parse a C-like programming language, and want to support multi-line comments. We use a subclass of StandardTokenParsers
, which already handles such comments (via StdLexical
. However, the class only works for fairly short multi-line comments, and runs out of stack space otherwise.
We have also tried providing our own definition of whitespace to make things more efficient. We used a RegexParser
(inspired by another question on StackOverflow) as follows:
class Parser extends StandardTokenParsers {
override val lexical = new StdLexical {
def rp: RegexParsers = new RegexParsers {}
override val whitespace: Parser[Any] = rp.regex("""(\s|//.*|(?m)/\*(\*(?!/)|[^*])*\*/)*""".r).asInstanceOf[Parser[Any]]
}
// ...
}
This improved the situation slightly, but still causes a stack overflow if the comment is more than a few dozen lines. Any ideas how to improve this?
We have had some success with this sort of issue by defining whitespace skipping using parsers instead of using regular expressions. See the WhitespaceParser trait in our Kiama ParserUtilities.scala for some support code.
Most of the mucking about is to override the normal regular expression whitespace handling and to tie the new parser into the literal and regex combinators (we don't typically use the token parsers). See one of our examples for usage, in this case to handle nested comments.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With