Forgive me if I have the incorrect terminology; perhaps just getting the "right" words to describe what I want is enough for me to find the answer on my own. I am working on a parser for ODL (Object Description Language), an arcane language that as far as I can tell is now used only by NASA PDS (Planetary Data Systems; it's how NASA makes its data available to the public). Fortunately, PDS is finally moving to XML, but I still have to write software for a mission that fell just before the cutoff. ODL defines objects in something like the following manner: <pre class="prettyprint"><code>OBJECT = TABLE ROWS = 128 ROW_BYTES = 512 END_OBJECT = TABLE </code></pre> I am attempting to write a parser with <code>pyparsing</code>, and I was doing fine right up until I came to the above construction. I have to create some rule that is able to ensure that the right-hand-value of the OBJECT line is identical to the RHV of END_OBJECT. But I can't seem to put that into a <code>pyparsing</code> rule. I can ensure that both are syntactically valid values, but I can't go the extra step and ensure that the values are identical. <ol> <li>Am I correct in my intuition that this is a context-sensitive grammar? Is that the phrase I should be using to describe this problem?</li> <li>Whatever kind of grammar this is in the theoretical sense, is <code>pyparsing</code> able to handle this kind of construction?</li> <li>If <code>pyparsing</code> is not able to handle it, is there another Python tool capable of doing so? How about <code>ply</code> (the Python implementation of <code>lex</code>/<code>yacc</code>)?</li> </ol>

It is in fact a grammar for a context-sensitive language, classically abstracted as <code>wcw</code> where w is in (a|b)* (note that <code>wcw'</code> , where <code>'</code> indicates reversal, is context-free). Parsing Expression Grammars are capable of parsing wcw-type languages by using semantic predicates. PyParsing provides the <code>matchPreviousExpr()</code> and <code>matchPreviousLiteral()</code> helper methods for this very purpose, e.g. <pre class="prettyprint"><code>w = Word("ab") s = w + "c" + matchPreviousExpr(w) </code></pre> So in your case you'd probably do something like <pre class="prettyprint"><code>table_name = Word(alphas, alphanums) object = Literal("OBJECT") + "=" + table_name + ... + Literal("END_OBJECT") + "=" +matchPreviousExpr(table_name) </code></pre>

As a general rule, parsers are built as context-free parsing engines. If there is context sensitivity, it is grafted on after parsing (or at least after the relevant parsing steps are completed). In your case, you want to write context-free grammar rules: <pre class="prettyprint"><code> head = 'OBJECT' '=' IDENTIFIER ; tail = 'END_OBJECT' '=' IDENTIFIER ; element = IDENTIFIER '=' value ; element_list = element ; element_list = element_list element ; block = head element_list tail ; </code></pre> The checks that the head and tail constructs have matching identifiers isn't technically done by the parser. Many parsers, however, allow a semantic action to occur when a syntactic element is recognized, often for the purpose of building tree nodes. In your case, you want to use this to enable additional checking. For element, you want to make sure the IDENTIFIER isn't a duplicate of something already in the block; this means for each element encountered, you'll want to capture the corresponding IDENTIFIER and make a block-specific list to enable duplicate checking. For block, you want to capture the head *IDENTIFIER*, and check that it matches the tail *IDENTIFIER*. This is easiest if you build a tree representing the parse as you go along, and hang the various context-sensitive values on the tree in various places (e.g., attach the actual IDENTIFIER value to the tree node for the head clause). At the point where you are building the tree node for the tail construct, it should be straightforward to walk up the tree, find the head tree, and then compare the identifiers. This is easier to think about if you imagine the entire tree being built first, and then a post-processing pass over the tree is used to this checking. Lazy people in fact do it this way :-} All we are doing is pushing work that could be done in the post processing step, into the tree-building steps attached to the semantic actions. None of these concepts is python specific, and the details for PyParsing will vary somewhat.

Does Pyparsing Support Context-Sensitive Grammars?

Tags:

python

parsing

pyparsing

ply

Forgive me if I have the incorrect terminology; perhaps just getting the "right" words to describe what I want is enough for me to find the answer on my own.

I am working on a parser for ODL (Object Description Language), an arcane language that as far as I can tell is now used only by NASA PDS (Planetary Data Systems; it's how NASA makes its data available to the public). Fortunately, PDS is finally moving to XML, but I still have to write software for a mission that fell just before the cutoff.

ODL defines objects in something like the following manner:

OBJECT              = TABLE
  ROWS              = 128
  ROW_BYTES         = 512 
END_OBJECT          = TABLE

I am attempting to write a parser with pyparsing, and I was doing fine right up until I came to the above construction.

I have to create some rule that is able to ensure that the right-hand-value of the OBJECT line is identical to the RHV of END_OBJECT. But I can't seem to put that into a pyparsing rule. I can ensure that both are syntactically valid values, but I can't go the extra step and ensure that the values are identical.

Am I correct in my intuition that this is a context-sensitive grammar? Is that the phrase I should be using to describe this problem?
Whatever kind of grammar this is in the theoretical sense, is pyparsing able to handle this kind of construction?
If pyparsing is not able to handle it, is there another Python tool capable of doing so? How about ply (the Python implementation of lex/yacc)?

405

asked Feb 27 '13 02:02

HardlyKnowEm

2 Answers

It is in fact a grammar for a context-sensitive language, classically abstracted as wcw where w is in (a|b)* (note that wcw' , where ' indicates reversal, is context-free).

Parsing Expression Grammars are capable of parsing wcw-type languages by using semantic predicates. PyParsing provides the matchPreviousExpr() and matchPreviousLiteral() helper methods for this very purpose, e.g.

w = Word("ab")
s = w + "c" + matchPreviousExpr(w)

So in your case you'd probably do something like

table_name = Word(alphas, alphanums)
object = Literal("OBJECT") + "=" + table_name + ... +
  Literal("END_OBJECT") + "=" +matchPreviousExpr(table_name)

169

answered Oct 01 '22 12:10

ebohlman

As a general rule, parsers are built as context-free parsing engines. If there is context sensitivity, it is grafted on after parsing (or at least after the relevant parsing steps are completed).

In your case, you want to write context-free grammar rules:

  head = 'OBJECT' '=' IDENTIFIER ;
  tail = 'END_OBJECT'  '=' IDENTIFIER ;
  element = IDENTIFIER '=' value ;
  element_list = element ;
  element_list = element_list element ;
  block = head element_list tail ;

The checks that the head and tail constructs have matching identifiers isn't technically done by the parser.

Many parsers, however, allow a semantic action to occur when a syntactic element is recognized, often for the purpose of building tree nodes. In your case, you want to use this to enable additional checking. For element, you want to make sure the IDENTIFIER isn't a duplicate of something already in the block; this means for each element encountered, you'll want to capture the corresponding IDENTIFIER and make a block-specific list to enable duplicate checking. For block, you want to capture the head *IDENTIFIER*, and check that it matches the tail *IDENTIFIER*.

This is easiest if you build a tree representing the parse as you go along, and hang the various context-sensitive values on the tree in various places (e.g., attach the actual IDENTIFIER value to the tree node for the head clause). At the point where you are building the tree node for the tail construct, it should be straightforward to walk up the tree, find the head tree, and then compare the identifiers.

This is easier to think about if you imagine the entire tree being built first, and then a post-processing pass over the tree is used to this checking. Lazy people in fact do it this way :-} All we are doing is pushing work that could be done in the post processing step, into the tree-building steps attached to the semantic actions.

None of these concepts is python specific, and the details for PyParsing will vary somewhat.

answered Oct 01 '22 13:10

Ira Baxter

Related questions
                            
                                What is the correct way to make SQLalchemy store strings as lowercase?
                            
                                how to crawl a site only given domain url with scrapy
                            
                                Python's glob module and unix' find command don't recognize non-ascii
                            
                                scipy.sparse dot extremely slow in Python
                            
                                Extract headings from a MS Word document in Python
                            
                                Can I just partially override __setattr__?
                            
                                Python-requests not clearing memory when downloading with sessions
                            
                                Common practices for modifying Python modules
                            
                                Is returning a calculated boolean pythonic or I should use traditional if/else? [closed]
                            
                                invalid syntax print in Python 3.3.0 [duplicate]
                            
                                PyPI local cache for Jenkins/local builds
                            
                                How can I assert calls that accept sequence arguments with Python Mock?
                            
                                how to isolate virtualenv from local dist-packages?
                            
                                Pandas DatetimeIndex truncate error
                            
                                MongoDb: $sort by $in
                            
                                PyOpenGL on a Macbook retina display
                            
                                keyError in Django. During template rendering
                            
                                HTTP 403 error retrieving robots.txt with mechanize
                            
                                request.path and url_for don't match up in Flask under mod_wsgi
                            
                                (Python) If Else issue and list to string conversion issue

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With