To preface this, my knowledge of this kind of stuff is puny.
Anyways, I've been developing a context-free grammar to describe the structure of alegbraic expressions so I can teach myself how the CYK parsing algorithm works. I understand how such a structure can work with only infix algebraic expressions, but I cannot understand how to develop a grammar that can handle both the unary and binary definitions of the "-" operator.
For reference, here's the grammar I've written (where S is the start symbol) in CNF:
S -> x
A -> O S
S -> L B
B -> S R
S -> K S
O -> +
O -> -
O -> *
O -> /
O -> ^
K -> -
L -> (
R -> )
The problem is that how can the CYK parsing algorithm know ahead of time whether to decide between S -> K S and A -> O S when it encounters the "-" operator? Is such a grammar context-free anymore? And most importantly, since programming languages can handle languages with both the binary and unary minus sign, how should I reasonably parse this?
In computer science, an ambiguous grammar is a context-free grammar for which there exists a string that can have more than one leftmost derivation or parse tree, while an unambiguous grammar is a context-free grammar for which every valid string has a unique leftmost derivation or parse tree.
A grammar is said to be ambiguous if there exists more than one leftmost derivation or more than one rightmost derivation or more than one parse tree for the given input string. If the grammar is not ambiguous, then it is called unambiguous. If the grammar has ambiguity, then it is not good for compiler construction.
A grammar is said to be ambiguous if there exists more than one left most derivation or more than one right most derivation or more than one parse tree for a given input string. If the grammar is not ambiguous then we call it unambiguous grammar. If the grammar has ambiguity then it is good for compiler construction.
This seems like a problem related to finite state automata and I don't remember everything from my coursework, but I wrote a CYK parser in OCaml, so I'll go ahead and take a shot :)
If you're trying to parse an expression like 3- -4
for example, you would have your S -> K S
rule consume the -4
and then your A -> O S
rule would absorb the - -4
. This would eventually work up to the top-most S
production rule. You should be careful with the grammar you're using though, since the A
production rule you listed cannot be reached from S
and you should probably have a S -> S O S
rule of some sort.
The way that CYK parsing algorithms work is through backtracking, not through the "knowing ahead of time" that you mentioned in your question. What your CYK algorithm should do is to parse the -4
as a S -> K S
rule and then it would try to absorb the second -
with the S -> K S
rule again because this production rule allows for an arbitrarily long chain of unary -
. But once your algorithm realizes that it's stuck with the intermediate parse 3 S
, it realizes that it has no production symbols that it can use to parse this. Once it realizes that this is no longer parseable, it will go back and instead try to parse the -
as an S -> O S
rule instead and continue on its merry way.
This means that your grammar remains context-free since a context-sensitive grammar means that you have terminals on the left side of the production rules, so you're good in that respect. HTH!
The grammar is ambiguous, and the parser cannot decide which case to take.
You should probably use a grammar like the following:
S -> EXPR
EXPR -> (EXPR)
EXPR -> - EXPR
EXPR -> EXPR + EXPR
EXPR -> EXPR - EXPR
// etc...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With