How can parsers be used to parse records that spans multiple lines? I need to parse tree data (and eventually transform it to a tree data structure). I'm getting a difficult-to-trace parse error in the code below, but its not clear if this is even the best approach with Scala parsers. The question is really more about the problem solving approach rather than debugging existing code.
The EBNF-ish grammer is:
SP = " "
CRLF = "\r\n"
level = "0" | "1" | "2" | "3"
varName = {alphanum}
varValue = {alphnum}
recordBegin = "0", varName
recordItem = level, varName, [varValue]
record = recordBegin, {recordItem}
file = {record}
An attempt to implement and test the grammer:
import util.parsing.combinator._
val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""
object TreeParser extends JavaTokenParsers {
override val skipWhitespace = false
def CRLF = "\r\n" | "\n"
def BOF = "\\A".r
def EOF = "\\Z".r
def TXT = "[^\r\n]*".r
def TXTNOSP = "[^ \r\n]*".r
def SP = "\\s".r
def level: Parser[Int] = "[0-3]{1}".r ^^ {v => v.toInt}
def varName: Parser[String] = SP ~> TXTNOSP
def varValue: Parser[String] = SP ~> TXT
def recordBegin: Parser[Any] = "0" ~ SP ~ varName ~ CRLF
def recordItem: Parser[(Int,String,String)] = level ~ varValue ~ opt(varValue) <~ CRLF ^^
{case l ~ f ~ v => (l,f,v.map(_+"").getOrElse(""))}
def record: Parser[List[(Int,String,String)]] = recordBegin ~> rep(recordItem)
def file: Parser[List[List[(Int,String,String)]]] = rep(record) <~ EOF
def parse(input: String) = parseAll(file, input)
}
val result = TreeParser.parse(input).get
result.foreach(println)
As Daniel said, you should better let the parser handle whitespace skipping to minimize your code. However you may want to tweak the whitespace
value so you can match end of lines explicitly. I did it below to prevent the parser from moving to the next line if no value for a record is defined.
As much as possible, try to use the parsers defined in JavaTokenParsers
like ident
if you want to match alphabetic words.
To ease your error tracing, perform a NoSuccess
match on parseAll
so you can see at what point the parser failed.
import util.parsing.combinator._
val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 var_without_value
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""
object TreeParser extends JavaTokenParsers {
override val whiteSpace = """[ \t]+""".r
val level = """[1-3]{1}""".r
val value = """[a-zA-Z0-9_, ]*""".r
val eol = """[\r?\n]+""".r
def recordBegin = "0" ~ ident <~ eol
def recordItem = level ~ ident ~ opt(value) <~ opt(eol) ^^ {
case l ~ n ~ v => (l.toInt, n, v.getOrElse(""))
}
def record = recordBegin ~> rep1(recordItem)
def file = rep1(record)
def parse(input: String) = parseAll(file, input) match {
case Success(result, _) => result
case NoSuccess(msg, _) => throw new RuntimeException("Parsing Failed:" + msg)
}
}
val result = TreeParser.parse(input)
result.foreach(println)
Handling whitespace explicitly is not a particularly good idea. And, of course, using get
means you lose the error message. In this particular example:
[1.3] failure: string matching regex `\s' expected but `f' found
0 fruit
^
Which is actually pretty clear, though the question is why it expected a space. Now, this was obviously processing a recordBegin
rule, which is defined thusly:
"0" ~ SP ~ varName ~ CRLF
So, it parsers the zero, then the space, and then fruit
must be parsed against varName
. Now, varName
is defined like this:
SP ~> TXTNOSP
Another space! So, fruit
should have began with a space.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With