Forgive me if I have the incorrect terminology; perhaps just getting the "right" words to describe what I want is enough for me to find the answer on my own.
I am working on a parser for ODL (Object Description Language), an arcane language that as far as I can tell is now used only by NASA PDS (Planetary Data Systems; it's how NASA makes its data available to the public). Fortunately, PDS is finally moving to XML, but I still have to write software for a mission that fell just before the cutoff.
ODL defines objects in something like the following manner:
OBJECT = TABLE
ROWS = 128
ROW_BYTES = 512
END_OBJECT = TABLE
I am attempting to write a parser with pyparsing
, and I was doing fine right up until I came to the above construction.
I have to create some rule that is able to ensure that the right-hand-value of the OBJECT line is identical to the RHV of END_OBJECT. But I can't seem to put that into a pyparsing
rule. I can ensure that both are syntactically valid values, but I can't go the extra step and ensure that the values are identical.
pyparsing
able to handle this kind of construction?pyparsing
is not able to handle it, is there another Python tool capable of doing so? How about ply
(the Python implementation of lex
/yacc
)?Context-sensitive grammar is an important grammar in formal language theory, in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols.
In context sensitive grammar, there is either left context or right context (αAβ i.e. α is left context and β is right) with variables. But in context free grammar (CFG) there will be no context. We cannot replace B until we get B0. Therefore, CSG is harder to understand than the CFG.
Context-free grammars are a subset of context-sensitive grammars. A grammar is a concise way to specify a language.
A formal grammar known as a CSG allows any production rules' left and right sides to be surrounded by a context of the terminal as well as nonterminal symbols.
It is in fact a grammar for a context-sensitive language, classically abstracted as wcw
where w is in (a|b)* (note that wcw'
, where '
indicates reversal, is context-free).
Parsing Expression Grammars are capable of parsing wcw-type languages by using semantic predicates. PyParsing provides the matchPreviousExpr()
and matchPreviousLiteral()
helper methods for this very purpose, e.g.
w = Word("ab")
s = w + "c" + matchPreviousExpr(w)
So in your case you'd probably do something like
table_name = Word(alphas, alphanums)
object = Literal("OBJECT") + "=" + table_name + ... +
Literal("END_OBJECT") + "=" +matchPreviousExpr(table_name)
As a general rule, parsers are built as context-free parsing engines. If there is context sensitivity, it is grafted on after parsing (or at least after the relevant parsing steps are completed).
In your case, you want to write context-free grammar rules:
head = 'OBJECT' '=' IDENTIFIER ;
tail = 'END_OBJECT' '=' IDENTIFIER ;
element = IDENTIFIER '=' value ;
element_list = element ;
element_list = element_list element ;
block = head element_list tail ;
The checks that the head and tail constructs have matching identifiers isn't technically done by the parser.
Many parsers, however, allow a semantic action to occur when a syntactic element is recognized, often for the purpose of building tree nodes. In your case, you want to use this to enable additional checking. For element, you want to make sure the IDENTIFIER isn't a duplicate of something already in the block; this means for each element encountered, you'll want to capture the corresponding IDENTIFIER and make a block-specific list to enable duplicate checking. For block, you want to capture the head *IDENTIFIER*, and check that it matches the tail *IDENTIFIER*.
This is easiest if you build a tree representing the parse as you go along, and hang the various context-sensitive values on the tree in various places (e.g., attach the actual IDENTIFIER value to the tree node for the head clause). At the point where you are building the tree node for the tail construct, it should be straightforward to walk up the tree, find the head tree, and then compare the identifiers.
This is easier to think about if you imagine the entire tree being built first, and then a post-processing pass over the tree is used to this checking. Lazy people in fact do it this way :-} All we are doing is pushing work that could be done in the post processing step, into the tree-building steps attached to the semantic actions.
None of these concepts is python specific, and the details for PyParsing will vary somewhat.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With