I'm trying to get a deeper understanding of how Python works, and I've been looking at the grammar shown at http://docs.python.org/3.3/reference/grammar.html.
I notice it says you would have to change parsermodule.c also, but truthfully I'm just not following what's going on here.
I understand that a grammar is a specification for how to read the language, but...I can't even tell what this is written in. It looks almost like Python but then it isn't.
I'm looking to get a better understanding of this specification and how it is used internally by Python to....do things. What depends on it (the answer is everything, but I mean specifically which aspect of the "engine" is processing it), what uses it, how does it tie in to compiling/running a script?
It's hard to believe that the whole language comes down to a two page specification...
A grammar is used to describe all possible strings in a language. It is also useful in specifying how a parser should parse the language.
In this grammar it seems like they are using their own version of EBNF, where a non-terminal is any lowercase word and a terminal is all uppercase or surrounded by quotes. For example, NEWLINE is a terminal, arith_expr is a non-terminal and 'if' is also a terminal. Any non-terminal can be replaced by anything to the right of the colon of it's respective production rule. For example, if you look at the first rule:
single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
We can replace single_input with one of either a NEWLINE, a simple_stmt or a compound_stmt followed by a NEWLINE. Suppose we replaced it with "compound_stmt NEWLINE", then we would look for the production rule for compound_stmt:
compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated
and choose which of these we want to use, and substitute it for "compound_stmt" (Keeping NEWLINE in it's place)
Suppose we wanted to generate the valid python program:
if 5 < 2 + 3 or not 1 == 5:
raise
We could use the following derivation:
A couple of notes here, firstly, we must start with one of the non-terminals which is listed as a starting non-terminal. In that page, they list them as single_input, file_input, or eval_input. Secondly, a derivation is finished once all the symbols are terminal (hence the name). Thirdly, it is more common to do one substitution per line, for the sake of brevity I did all possible substitutions at once and started skipping steps near the end.
Given a string in the language, how do we find it's derivation? This is the job of a parser. A parser reverse-engineers a production sequence to first check that it is indeed a valid string, and furthermore how it can be derived from the grammar. It's worth noting that many grammars can describe a single language. However, for a given string, it's derivation will of course be different for each grammar. So technically we write a parser for a grammar not a language. Some grammars are easier to parse, some grammars are easier to read/understand. This one belongs in the former.
Also this doesn't specify the entire language, just what it looks like. A grammar says nothing about semantics.
If you're interested in more about parsing and grammar I recommend Grune, Jacobs - Parsing Techniques. It's free and good for self-study.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With