Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

replace semicolon by newline in python code

I would like to parse Python code that contains semicolons ; for separating commands and produce code that replaces those by newlines \n. E.g., from

def main():
    a = "a;b"; return a

I'd like to produce

def main():
    a = "a;b"
    return a

Any hints?

like image 230
Nico Schlömer Avatar asked Jun 13 '16 11:06

Nico Schlömer


1 Answers

Use the tokenize library to look for token.OP tokens, where the second element is a ; *. Replace these tokens with a token.NEWLINE token.

You'd need to adjust your token offsets and generate matching indent too however; so after a NEWLINE you'd need to adjust line numbers (increment by an offset you increase for every NEWLINE you insert) and the 'next' line (remainder of the current line) would have to have the indices adjusted to match the current indentation level:

import tokenize

TokenInfo = getattr(tokenize, 'TokenInfo', lambda *a: a)  # Python 3 compat

def semicolon_to_newline(tokens):
    line_offset = 0
    last_indent = None
    col_offset = None  # None or an integer
    for ttype, tstr, (slno, scol), (elno, ecol), line in tokens:
        slno, elno = slno + line_offset, elno + line_offset
        if ttype in (tokenize.INDENT, tokenize.DEDENT):
            last_indent = ecol  # block is indented to this column
        elif ttype == tokenize.OP and tstr == ';':
            # swap out semicolon with a newline
            ttype = tokenize.NEWLINE
            tstr = '\n'
            line_offset += 1
            if col_offset is not None:
                scol, ecol = scol - col_offset, ecol - col_offset
            col_offset = 0  # next tokens should start at the current indent
        elif col_offset is not None:
            if not col_offset:
                # adjust column by starting column of next token
                col_offset = scol - last_indent
            scol, ecol = scol - col_offset, ecol - col_offset
            if ttype == tokenize.NEWLINE:
                col_offset = None
        yield TokenInfo(
            ttype, tstr, (slno, scol), (elno, ecol), line)

with open(sourcefile, 'r') as source, open(destination, 'w') as dest:
    generator = tokenize.generate_tokens(source.readline)
    dest.write(tokenize.untokenize(semicolon_to_newline(generator)))

Note that I don't bother to correct the line value; it is informative only, the data that was read from the file is not actually used when un-tokenizing.

Demo:

>>> from io import StringIO
>>> source = StringIO('''\
... def main():
...     a = "a;b"; return a
... ''')
>>> generator = tokenize.generate_tokens(source.readline)
>>> result = tokenize.untokenize(semicolon_to_newline(generator))
>>> print(result)
def main():
    a = "a;b"
    return a

and slightly more complex:

>>> source = StringIO('''\
... class Foo(object):
...     def bar(self):
...         a = 10; b = 11; c = 12
...         if self.spam:
...             x = 12; return x
...         x = 15; return y
...
...     def baz(self):
...         return self.bar;
...         # note, nothing after the semicolon
... ''')
>>> generator = tokenize.generate_tokens(source.readline)
>>> result = tokenize.untokenize(semicolon_to_newline(generator))
>>> print(result)
class Foo(object):
    def bar(self):
        a = 10
        b = 11
        c = 12
        if self.spam:
            x = 12
            return x
        x = 15
        return y

    def baz(self):
        return self.bar

        # note, nothing after the semicolon

>>> print(result.replace(' ', '.'))
class.Foo(object):
....def.bar(self):
........a.=.10
........b.=.11
........c.=.12
........if.self.spam:
............x.=.12
............return.x
........x.=.15
........return.y

....def.baz(self):
........return.self.bar
........
........#.note,.nothing.after.the.semicolon

* The Python 3 version of tokenize outputs more informative TokenInfo named tuples, which have an extra exact_type attribute that can be used instead of doing a text match: tok.exact_type == tokenize.SEMI. I kept the above compatible with Python 2 and 3 however.

like image 118
Martijn Pieters Avatar answered Oct 18 '22 00:10

Martijn Pieters