Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala, Parser Combinator for Tree Structured Data

How can parsers be used to parse records that spans multiple lines? I need to parse tree data (and eventually transform it to a tree data structure). I'm getting a difficult-to-trace parse error in the code below, but its not clear if this is even the best approach with Scala parsers. The question is really more about the problem solving approach rather than debugging existing code.

The EBNF-ish grammer is:

SP          = " "
CRLF        = "\r\n"
level       = "0" | "1" | "2" | "3"
varName     = {alphanum}
varValue    = {alphnum}
recordBegin = "0", varName
recordItem  = level, varName, [varValue]
record      = recordBegin, {recordItem}
file        = {record}

An attempt to implement and test the grammer:

import util.parsing.combinator._
val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""

object TreeParser extends JavaTokenParsers {
  override val skipWhitespace = false
  def CRLF = "\r\n" | "\n"
  def BOF = "\\A".r
  def EOF = "\\Z".r
  def TXT = "[^\r\n]*".r
  def TXTNOSP = "[^ \r\n]*".r
  def SP = "\\s".r
  def level: Parser[Int] = "[0-3]{1}".r ^^ {v => v.toInt}
  def varName: Parser[String] = SP ~> TXTNOSP
  def varValue: Parser[String] = SP ~> TXT
  def recordBegin: Parser[Any] =  "0" ~ SP ~ varName ~ CRLF
  def recordItem: Parser[(Int,String,String)] = level ~ varValue ~ opt(varValue) <~ CRLF ^^
    {case l ~ f ~ v => (l,f,v.map(_+"").getOrElse(""))}
  def record: Parser[List[(Int,String,String)]] = recordBegin ~> rep(recordItem)
  def file: Parser[List[List[(Int,String,String)]]] = rep(record) <~ EOF
  def parse(input: String) = parseAll(file, input)
}

val result = TreeParser.parse(input).get
result.foreach(println)
like image 931
eptx Avatar asked May 30 '11 20:05

eptx


2 Answers

As Daniel said, you should better let the parser handle whitespace skipping to minimize your code. However you may want to tweak the whitespace value so you can match end of lines explicitly. I did it below to prevent the parser from moving to the next line if no value for a record is defined.

As much as possible, try to use the parsers defined in JavaTokenParsers like ident if you want to match alphabetic words.

To ease your error tracing, perform a NoSuccess match on parseAll so you can see at what point the parser failed.

import util.parsing.combinator._

val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 var_without_value
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""

object TreeParser extends JavaTokenParsers {
  override val whiteSpace = """[ \t]+""".r

  val level = """[1-3]{1}""".r

  val value = """[a-zA-Z0-9_, ]*""".r
  val eol = """[\r?\n]+""".r

  def recordBegin = "0" ~ ident <~ eol

  def recordItem = level ~ ident ~ opt(value) <~ opt(eol) ^^ {
    case l ~ n ~ v => (l.toInt, n, v.getOrElse(""))
  }

  def record = recordBegin ~> rep1(recordItem)

  def file = rep1(record)

  def parse(input: String) = parseAll(file, input) match {
    case Success(result, _) => result
    case NoSuccess(msg, _) => throw new RuntimeException("Parsing Failed:" + msg)
  }
}

val result = TreeParser.parse(input)
result.foreach(println)
like image 55
Mark Jayxcela Avatar answered Oct 18 '22 00:10

Mark Jayxcela


Handling whitespace explicitly is not a particularly good idea. And, of course, using get means you lose the error message. In this particular example:

[1.3] failure: string matching regex `\s' expected but `f' found

0 fruit

  ^

Which is actually pretty clear, though the question is why it expected a space. Now, this was obviously processing a recordBegin rule, which is defined thusly:

"0" ~ SP ~ varName ~ CRLF

So, it parsers the zero, then the space, and then fruit must be parsed against varName. Now, varName is defined like this:

SP ~> TXTNOSP

Another space! So, fruit should have began with a space.

like image 3
Daniel C. Sobral Avatar answered Oct 18 '22 00:10

Daniel C. Sobral