I am consulting the official Python grammar specification as of Python 3.6.
I am unable to find any syntax for comments (they appear prepended with a #
) and docstrings (they should appear with '''
). A quick look at the lexical analysis page didn't help either - docstrings are defined there as longstrings
but do not appear in the grammar specifications. A type named STRING
appears further, but no reference to its definition takes place.
Given this, I am curious about how the CPython compiler knows what comments and docstrings are. How is this feat accomplished?
I originally guessed that comments and docstrings are removed in a first pass by the CPython compiler, but then that beggars the question of how help()
is able to render the relevant docstrings.
A docstring is not a separate grammar entity. It is just a regular simple_stmt
(following that rule down all the way to atom
and STRING+
*. If it is the first statement in a function body, class or module, then it used as the docstring by the compiler.
This is documented in the reference documentation as footnotes to the class
and def
compound statements:
[3] A string literal appearing as the first statement in the function body is transformed into the function’s
__doc__
attribute and therefore the function’s docstring.[4] A string literal appearing as the first statement in the class body is transformed into the namespace’s
__doc__
item and therefore the class’s docstring.
There currently is no reference documentation that specifies the same for modules, I regard this as a documentation bug.
Comments are removed by the tokenizer and never need to be parsed as grammar. Their whole point is to not have meaning on a grammar level. See the Comments section of the Lexical Analysis documentation:
A comment starts with a hash character (#) that is not part of a string literal, and ends at the end of the physical line. A comment signifies the end of the logical line unless the implicit line joining rules are invoked. Comments are ignored by the syntax; they are not tokens.
Bold emphasis mine. So the tokenizer skips comments altogether:
/* Skip comment */
if (c == '#') {
while (c != EOF && c != '\n') {
c = tok_nextc(tok);
}
}
Note that Python source code goes through 3 steps:
The grammar only applies to the parsing stage; comments are dropped in the tokenizer, and docstrings are only special to the compiler.
To illustrate how the parser doesn't treat docstrings as anything other than a string literal expression, you can access any Python parse results as an Abstract Syntax Tree, via the ast
module. This produces Python objects that directly reflect the parse tree that the Python grammar parser produces, from which Python bytecode is then compiled:
>>> import ast
>>> function = 'def foo():\n "docstring"\n'
>>> parse_tree = ast.parse(function)
>>> ast.dump(parse_tree)
"Module(body=[FunctionDef(name='foo', args=arguments(args=[], vararg=None, kwonlyargs=[], kw_defaults=[], kwarg=None, defaults=[]), body=[Expr(value=Str(s='docstring'))], decorator_list=[], returns=None)])"
>>> parse_tree.body[0]
<_ast.FunctionDef object at 0x107b96ba8>
>>> parse_tree.body[0].body[0]
<_ast.Expr object at 0x107b16a20>
>>> parse_tree.body[0].body[0].value
<_ast.Str object at 0x107bb3ef0>
>>> parse_tree.body[0].body[0].value.s
'docstring'
So you have FunctionDef
object, which has, as the first element in the body, an expression that is a Str
with value 'docstring'
. It is the compiler that then generates a code object, storing that docstring in a separate attribute.
You can compile the AST into bytecode with the compile()
function; again, this is using the actual codepaths the Python interpreter uses. We'll use the dis
module to decompile the bytecode for us:
>>> codeobj = compile(parse_tree, '', 'exec')
>>> import dis
>>> dis.dis(codeobj)
1 0 LOAD_CONST 0 (<code object foo at 0x107ac9d20, file "", line 1>)
2 LOAD_CONST 1 ('foo')
4 MAKE_FUNCTION 0
6 STORE_NAME 0 (foo)
8 LOAD_CONST 2 (None)
10 RETURN_VALUE
So the compiled code produced the top-level statements for a module. The MAKE_FUNCTION
opcode uses a stored codeobject (part of the top-level code object constants) to build a function. So we look at that nested code object, at index 0:
>>> dis.dis(codeobj.co_consts[0])
1 0 LOAD_CONST 1 (None)
2 RETURN_VALUE
Here the docstring appears to be gone. The function does nothing more than return None
. The docstring is instead stored as a constant:
>>> codeobj.co_consts[0].co_consts
('docstring', None)
When executing the MAKE_FUNCTION
opcode, it is that first constant, provided it is a string, that is turned into the __doc__
attribute for the function object.
Once compiled, we can execute the code object with the exec()
function into a given namespace, which adds a function object with a docstring:
>>> namespace = {}
>>> exec(codeobj, namespace)
>>> namespace['foo']
<function foo at 0x107c23e18>
>>> namespace['foo'].__doc__
'docstring'
So it's the job of the compiler to determine when something is a docstring. This is done in C code, in the compiler_isdocstring()
function:
static int
compiler_isdocstring(stmt_ty s)
{
if (s->kind != Expr_kind)
return 0;
if (s->v.Expr.value->kind == Str_kind)
return 1;
if (s->v.Expr.value->kind == Constant_kind)
return PyUnicode_CheckExact(s->v.Expr.value->v.Constant.value);
return 0;
}
This is called from locations where a docstring makes sense; for modules and classes, in compiler_body()
, and for functions, in compiler_function()
.
TLDR: comments are not part of the grammar, because the grammar parser never even sees comments. They are skipped by the tokenizer. Docstrings are not part of the grammar, because to the grammar parser they are just string literals. It is the compilation step (taking the parse tree output of the parser) that interprets those string expressions as docstrings.
* The full grammar rule path is simple_stmt
-> small_stmt
-> expr_stmt
-> testlist_star_expr
-> star_expr
-> expr
-> xor_expr
-> and_expr
-> shift_expr
-> arith_expr
-> term
-> factor
-> power
-> atom_expr
-> atom
-> STRING+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With