Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python lexical analysis - logical line & compound statements

So I understand that:

The end of a logical line is represented by the token NEWLINE

This means the way Python's grammar is defined the only way to end a logical line is with a \n token.

The same goes for physical lines (rather an EOL, which is the EOL of the platform you're using when writing the file but nevertheless converted to a universal \n by Python.

A logical line can or cannot be equivalent to one or more physical lines, but usually it's one, and most of the times it's one if you write clean code.

In the sense that:

foo = 'some_value'  # 1 logical line = 1 physical  
foo, bar, baz = 'their', 'corresponding', 'values'  # 1 logical line = 1 physical
some_var, another_var = 10, 10; print(some_var, another_var); some_fn_call()

# the above is still still 1 logical line = 1 physical line
# because ; is not a terminator per se but a delimiter
# since Python doesn't use EBNF exactly but rather a modified form of BNF

# p.s one should never write code as the last line, it's just for educational purposes

Without showing examples of how 1 logical is equivalent to > 1 physical, my question is the following part from the docs:

Statements cannot cross logical line boundaries except where NEWLINE is allowed by the syntax (e.g., between statements in compound statements)

But what does this even mean? I understand the list of the compound statements, be them: if, while, for, etc. they are all made up of one or multiple clauses and each clause, in turn is made up of a header and a suite. The suite is made up of one or multiple statements, let's take an example to be more specific:

So the if statement is something like this according to the grammar (excluding the elifs and else clauses):

if_stmt ::=  "if" expression ":" suite

where the suite and its subsequent statements:

suite         ::=  stmt_list NEWLINE | NEWLINE INDENT statement+ DEDENT
statement     ::=  stmt_list NEWLINE | compound_stmt
stmt_list     ::=  simple_stmt (";" simple_stmt)* [";"]

so this means that if you want you can choose (given by "|") your suite to be 1 of 2 ways:

  1. on the same line:

    disadvantages: not pythonic and you cannot have another compound statement that introduces a new block (like a func def, another if, etc)

    advatanges: one liner I guess

example:

if 'truthy_string': foo, bar, baz = 1, 2, 3; print('whatever'); call_some_fn();
  1. introduce a new block:

    advantages: all, and the proper way to do it

example:

if 'truthy_value':
    first_stmt = 5
    second_stmt = 10
    a, b, c = 1, 2, 3
    func_call()
    result = inception(nested(calls(one_param), another_param), yet_another))

but I don't see how

Statements cannot cross logical line boundaries except where NEWLINE is allowed by the syntax

What I see above is a suite, which is a block of code controlled by the if clause, and in turn, that suite, is made up of logical, independent lines (statements), where each logical line is one physical line (coincidentally). I don't see how one logical line can cross the boundaries (which basically is just a fancy word for the end, the limit, which is newline), I don't see how one statement can cross those boundaries and span into the next statement, or maybe I'm really confused and have everything mixed up, but if someone can please explain.

Thank you for your time in advance.

like image 731
Marius Mucenicu Avatar asked Mar 28 '18 08:03

Marius Mucenicu


People also ask

How does Python lexer work?

The lexer splits the code into tokens (keywords, identifiers, numbers, etc.), and the parser assembles the tokens into an abstract syntax tree. Most of the white space magic is in the lexer, which emits three special tokens: NEWLINE , INDENT , and DEDENT .

What is a Python R string?

Python raw string is created by prefixing a string literal with 'r' or 'R'. Python raw string treats backslash (\) as a literal character. This is useful when we want to have a string that contains backslash and don't want it to be treated as an escape character.

How do you backslash a string in Python?

In Python strings, the backslash "\" is a special character, also called the "escape" character. It is used in representing certain whitespace characters: "\t" is a tab, "\n" is a newline, and "\r" is a carriage return. Conversely, prefixing a special character with "\" turns it into an ordinary character.


1 Answers

Pythons grammar

Fortunately there is a Full Grammar specification in the Python documentation.

A statement is defined in that specification as:

stmt: simple_stmt | compound_stmt

And a logical line is delimited by NEWLINE (that's not in the specification but based on your question).

Step-by-step

Okay, let's go through this, what's the specification for a

simple_stmt:

simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt |
             import_stmt | global_stmt | nonlocal_stmt | assert_stmt)

Okay now it goes into several different paths and it probably doesn't make sense to go through all of them separately but based on the specification a simple_stmt could cross logical line boundaries if any of the small_stmts contains a NEWLINE (currently they don't but could).

Apart from that only theoretical possibility there is actually the

compound_stmt:

compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt
[...]
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
[...]
suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

I picked only the if statement and suite because it already suffices. The if statement including elif and else and all of the content in these is one statement (a compound statement). And because it may contain NEWLINEs (if the suite isn't just a simple_stmt) it already fulfills the requirement of "a statement that crosses logical line boundaries".

An example if (schematic):

if 1:
    100
    200

would be:

if_stmt
|---> test        --> 1
|---> NEWLINE
|---> INDENT
|---> expr_stmt   --> 100
|---> NEWLINE
|---> expr_stmt   --> 200
|---> NEWLINE
|---> DEDENT

And all of this belongs to the if statement (and it's not just a block "controlled" by the if or while, ...).

The same if with parser, symbol and token

A way to visualize that would be using the built-in parser, token and symbol modules (really, I haven't known about this modules before I wrote the answer):

import symbol
import parser
import token

s = """
if 1:
    100
    200
"""
st = parser.suite(s)

def recursive_print(inp, level=0):
    for idx, item in enumerate(inp):
        if isinstance(item, int):
            print('.'*level, symbol.sym_name.get(item, token.tok_name.get(item, item)), sep="")
        elif isinstance(item, list):
            recursive_print(item, level+1)
        else:
            print('.'*level, repr(item), sep="")

recursive_print(st.tolist())

Actually I cannot explain most of the parser result but it shows (if you remove a lot of unnecessary lines) that the suite including it's newlines really belongs to the if_stmt. Indentation represents the "depth" of the parser at a specific point.

file_input
.stmt
..compound_stmt
...if_stmt
....NAME
....'if'
....test
.........expr
...................NUMBER
...................'1'
....COLON
....suite
.....NEWLINE
.....INDENT
.....stmt
...............expr
.........................NUMBER
.........................'100'
.......NEWLINE
.....stmt
...............expr
.........................NUMBER
.........................'200'
.......NEWLINE
.....DEDENT
.NEWLINE
.ENDMARKER

That could probably be made much more beautiful but I hope it serves as illustration even in it's current form.

like image 146
MSeifert Avatar answered Nov 02 '22 06:11

MSeifert