Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Python's grammar specification not include docstrings and comments?

I am consulting the official Python grammar specification as of Python 3.6.

I am unable to find any syntax for comments (they appear prepended with a #) and docstrings (they should appear with '''). A quick look at the lexical analysis page didn't help either - docstrings are defined there as longstrings but do not appear in the grammar specifications. A type named STRING appears further, but no reference to its definition takes place.

Given this, I am curious about how the CPython compiler knows what comments and docstrings are. How is this feat accomplished?

I originally guessed that comments and docstrings are removed in a first pass by the CPython compiler, but then that beggars the question of how help() is able to render the relevant docstrings.

like image 421
Akshat Mahajan Avatar asked Dec 06 '22 14:12

Akshat Mahajan


1 Answers

A docstring is not a separate grammar entity. It is just a regular simple_stmt (following that rule down all the way to atom and STRING+ *. If it is the first statement in a function body, class or module, then it used as the docstring by the compiler.

This is documented in the reference documentation as footnotes to the class and def compound statements:

[3] A string literal appearing as the first statement in the function body is transformed into the function’s __doc__ attribute and therefore the function’s docstring.

[4] A string literal appearing as the first statement in the class body is transformed into the namespace’s __doc__ item and therefore the class’s docstring.

There currently is no reference documentation that specifies the same for modules, I regard this as a documentation bug.

Comments are removed by the tokenizer and never need to be parsed as grammar. Their whole point is to not have meaning on a grammar level. See the Comments section of the Lexical Analysis documentation:

A comment starts with a hash character (#) that is not part of a string literal, and ends at the end of the physical line. A comment signifies the end of the logical line unless the implicit line joining rules are invoked. Comments are ignored by the syntax; they are not tokens.

Bold emphasis mine. So the tokenizer skips comments altogether:

/* Skip comment */
if (c == '#') {
    while (c != EOF && c != '\n') {
        c = tok_nextc(tok);
    }
}

Note that Python source code goes through 3 steps:

  1. Tokenizing
  2. Parsing
  3. Compilation

The grammar only applies to the parsing stage; comments are dropped in the tokenizer, and docstrings are only special to the compiler.

To illustrate how the parser doesn't treat docstrings as anything other than a string literal expression, you can access any Python parse results as an Abstract Syntax Tree, via the ast module. This produces Python objects that directly reflect the parse tree that the Python grammar parser produces, from which Python bytecode is then compiled:

>>> import ast
>>> function = 'def foo():\n    "docstring"\n'
>>> parse_tree = ast.parse(function)
>>> ast.dump(parse_tree)
"Module(body=[FunctionDef(name='foo', args=arguments(args=[], vararg=None, kwonlyargs=[], kw_defaults=[], kwarg=None, defaults=[]), body=[Expr(value=Str(s='docstring'))], decorator_list=[], returns=None)])"
>>> parse_tree.body[0]
<_ast.FunctionDef object at 0x107b96ba8>
>>> parse_tree.body[0].body[0]
<_ast.Expr object at 0x107b16a20>
>>> parse_tree.body[0].body[0].value
<_ast.Str object at 0x107bb3ef0>
>>> parse_tree.body[0].body[0].value.s
'docstring'

So you have FunctionDef object, which has, as the first element in the body, an expression that is a Str with value 'docstring'. It is the compiler that then generates a code object, storing that docstring in a separate attribute.

You can compile the AST into bytecode with the compile() function; again, this is using the actual codepaths the Python interpreter uses. We'll use the dis module to decompile the bytecode for us:

>>> codeobj = compile(parse_tree, '', 'exec')
>>> import dis
>>> dis.dis(codeobj)
  1           0 LOAD_CONST               0 (<code object foo at 0x107ac9d20, file "", line 1>)
              2 LOAD_CONST               1 ('foo')
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (foo)
              8 LOAD_CONST               2 (None)
             10 RETURN_VALUE

So the compiled code produced the top-level statements for a module. The MAKE_FUNCTION opcode uses a stored codeobject (part of the top-level code object constants) to build a function. So we look at that nested code object, at index 0:

>>> dis.dis(codeobj.co_consts[0])
  1           0 LOAD_CONST               1 (None)
              2 RETURN_VALUE

Here the docstring appears to be gone. The function does nothing more than return None. The docstring is instead stored as a constant:

>>> codeobj.co_consts[0].co_consts
('docstring', None)

When executing the MAKE_FUNCTION opcode, it is that first constant, provided it is a string, that is turned into the __doc__ attribute for the function object.

Once compiled, we can execute the code object with the exec() function into a given namespace, which adds a function object with a docstring:

>>> namespace = {}
>>> exec(codeobj, namespace)
>>> namespace['foo']
<function foo at 0x107c23e18>
>>> namespace['foo'].__doc__
'docstring'

So it's the job of the compiler to determine when something is a docstring. This is done in C code, in the compiler_isdocstring() function:

static int
compiler_isdocstring(stmt_ty s)
{
    if (s->kind != Expr_kind)
        return 0;
    if (s->v.Expr.value->kind == Str_kind)
        return 1;
    if (s->v.Expr.value->kind == Constant_kind)
        return PyUnicode_CheckExact(s->v.Expr.value->v.Constant.value);
    return 0;
}

This is called from locations where a docstring makes sense; for modules and classes, in compiler_body(), and for functions, in compiler_function().


TLDR: comments are not part of the grammar, because the grammar parser never even sees comments. They are skipped by the tokenizer. Docstrings are not part of the grammar, because to the grammar parser they are just string literals. It is the compilation step (taking the parse tree output of the parser) that interprets those string expressions as docstrings.


* The full grammar rule path is simple_stmt -> small_stmt -> expr_stmt -> testlist_star_expr -> star_expr -> expr -> xor_expr -> and_expr -> shift_expr -> arith_expr -> term -> factor -> power -> atom_expr -> atom -> STRING+

like image 158
Martijn Pieters Avatar answered Dec 21 '22 22:12

Martijn Pieters