Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a "simple" grammar

Sorry in advance; I am sure this question will seem almost idiotic to people who are used to playing with parsers and grammars, but these are foreign topics to me, and this is my attempt at stepping gently into a practical case requiring them.

I would like to write a parser for the following "language", which contains a single "special structure" looking like this:

\command[ options ]{ contents }

The contents can be anything, including nested commands, and may contain escaped brackets or backslashes \{ \} \\. I realise that "anything" is not specific, but ideally they should be determined by matching brackets (excluding escaped ones), if possible.

The options should be a comma-separated list of assignment expressions such as name = value, but the value may be a quoted string containing = or , characters. Finally the previous name and command should validate the regular expression \w[\w\d\._-+*]* -- that is, the first character should be a letter, and the remaining characters should be a letter, digit or one of . _ - + *.

Writing this with regular expressions seems overly complicated (for instance, because values may contain quoted characters , =, which would otherwise separate assignments or name/value pairs). So I think the most appropriate tool here is a grammar, but despite superficial readings, I am just not sure how to write it (BNF, PEG, etc?), which type of parsers to use (LR, recursive decent, etc?), and how I could use the parsing output in a practical program.

I would prefer answers with Python, which explains the tag, but of course I would be perfectly happy with a combination of tools if necessary/better suited.


NOTE: this is NOT about LaTeX. I realise the similarity of course, but LaTeX is extremely more complex than the previous language, for instance with character codes varying depending on the context. I am merely asking for a practical example which (I think) is simple enough for SO, and yet would already be useful to me in my daily work.

like image 934
Jonathan H Avatar asked Apr 11 '17 15:04

Jonathan H


People also ask

What is parsing in grammar?

Parsing is a grammatical exercise that involves breaking down a text into its component parts of speech with an explanation of the form, function, and syntactic relationship of each part so that the text can be understood. The term "parsing" comes from the Latin pars for "part (of speech)."

What is parsing in simple terms?

To parse is to break up a sentence or group of words into separate components, including the definition of each part's function or form. The technical definition implies the same concept. Parsing is used in all high-level programming languages.

What is an example of parsing?

Parse is defined as to break something down into its parts, particularly for study of the individual parts. An example of to parse is to break down a sentence to explain each element to someone. To examine closely or subject to detailed analysis, especially by breaking up into components.

How can you identify parsing in a sentence?

Parsing is identifying the function of words in sentences. Each word must be looked at in context to decide which part of speech it is. For example, the sentence 'Six fish swam quickly' can be parsed as adjective, common noun, verb, adverb.


1 Answers

Start by expressing your grammar more formally, in whatever notation you prefer. e.g., from your description, an EBNF would be like this:

program := element+
element := command | literal
literal := (not '\')+

command := '\'identifier options? '{' program '}'
options := option | options ',' option
option  := identifier '=' value
value   := number | string

string  := '"' (escape | not '\' or '"')* '"'
escape  : = '\' char

Then either feed this to a parser generator (pyParsing, pyYACC, ANTLR) or write a parser by hand. In the latter case top-down is the simplest option: start from the top of the grammar and convert each rule to a function which will either return a parsed AST node and consume the input or return nothing or throw. Example:

 def program():
    elements = []
    while next_sym():
        elements.append(element())
    return {'type': 'program', 'children': elements}

 def element():
     return command() or literal()

 def command():
     if next_sym() == '\\':
         get_sym()
         ...parse command here
         return {'type': 'command', 'children': ...}
     return None 

where next_sym returns the next symbol from the input (or None on EOF) and get_sym consumes the symbol and advances the input buffer.

like image 139
georg Avatar answered Oct 17 '22 16:10

georg