How can parsers be used to parse records that spans multiple lines? I need to parse tree data (and eventually transform it to a tree data structure). I'm getting a difficult-to-trace parse error in the code below, but its not clear if this is even the best approach with Scala parsers. The question is really more about the problem solving approach rather than debugging existing code.
The EBNF-ish grammer is:
SP = " "
CRLF = "\r\n"
level = "0" | "1" | "2" | "3"
varName = {alphanum}
varValue = {alphnum}
recordBegin = "0", varName
recordItem = level, varName, [varValue]
record = recordBegin, {recordItem}
file = {record}
An attempt to implement and test the grammer:
import util.parsing.combinator._
val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""
object TreeParser extends JavaTokenParsers {
override val skipWhitespace = false
def CRLF = "\r\n" | "\n"
def BOF = "\\A".r
def EOF = "\\Z".r
def TXT = "[^\r\n]*".r
def TXTNOSP = "[^ \r\n]*".r
def SP = "\\s".r
def level: Parser[Int] = "[0-3]{1}".r ^^ {v => v.toInt}
def varName: Parser[String] = SP ~> TXTNOSP
def varValue: Parser[String] = SP ~> TXT
def recordBegin: Parser[Any] = "0" ~ SP ~ varName ~ CRLF
def recordItem: Parser[(Int,String,String)] = level ~ varValue ~ opt(varValue) <~ CRLF ^^
{case l ~ f ~ v => (l,f,v.map(_+"").getOrElse(""))}
def record: Parser[List[(Int,String,String)]] = recordBegin ~> rep(recordItem)
def file: Parser[List[List[(Int,String,String)]]] = rep(record) <~ EOF
def parse(input: String) = parseAll(file, input)
}
val result = TreeParser.parse(input).get
result.foreach(println)
As Daniel said, you should better let the parser handle whitespace skipping to minimize your code. However you may want to tweak the whitespace value so you can match end of lines explicitly. I did it below to prevent the parser from moving to the next line if no value for a record is defined.
As much as possible, try to use the parsers defined in JavaTokenParsers like ident if you want to match alphabetic words.
To ease your error tracing, perform a NoSuccess match on parseAll so you can see at what point the parser failed.
import util.parsing.combinator._
val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 var_without_value
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""
object TreeParser extends JavaTokenParsers {
override val whiteSpace = """[ \t]+""".r
val level = """[1-3]{1}""".r
val value = """[a-zA-Z0-9_, ]*""".r
val eol = """[\r?\n]+""".r
def recordBegin = "0" ~ ident <~ eol
def recordItem = level ~ ident ~ opt(value) <~ opt(eol) ^^ {
case l ~ n ~ v => (l.toInt, n, v.getOrElse(""))
}
def record = recordBegin ~> rep1(recordItem)
def file = rep1(record)
def parse(input: String) = parseAll(file, input) match {
case Success(result, _) => result
case NoSuccess(msg, _) => throw new RuntimeException("Parsing Failed:" + msg)
}
}
val result = TreeParser.parse(input)
result.foreach(println)
Handling whitespace explicitly is not a particularly good idea. And, of course, using get means you lose the error message. In this particular example:
[1.3] failure: string matching regex `\s' expected but `f' found
0 fruit
^
Which is actually pretty clear, though the question is why it expected a space. Now, this was obviously processing a recordBegin rule, which is defined thusly:
"0" ~ SP ~ varName ~ CRLF
So, it parsers the zero, then the space, and then fruit must be parsed against varName. Now, varName is defined like this:
SP ~> TXTNOSP
Another space! So, fruit should have began with a space.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With