Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting comments from Python Source Code

I'm trying to write a program to extract comments in code that user enters. I tried to use regex, but found it difficult to write.

Then I found a post here. The answer suggests to use tokenize.generate_tokens to analyze the grammar, but the documentation says:

The generate_tokens() generator requires one argument, readline, which must be a callable object which provides the same interface as the readline() method of built-in file objects (see section File Objects).

But a string object does not have readline method.

Then I found another post here, suggesting to use StringIO.StringIO to get a readline method. So I wrote the following code:

import tokenize
import io
import StringIO

def extract(code):
    res = []
    comment = None
    stringio = StringIO.StringIO(code)
    for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio):
        # print(toknum,tokval)
        if toktype != tokenize.COMMENT:
            res.append((toktype, tokval))
        else:
            print tokenize.untokenize(toktype)
    return tokenize.untokenize(res)

And entered the following code: extract('a = 1+2#A Comment')

But got:

Traceback (most recent call last):     
   File "<stdin>", line 1, in <module>     
   File "ext.py", line 10, in extract     
     for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio):     
   File "C:\Python27\lib\tokenize.py", line 294, in generate_tokens     
     line = readline()     
AttributeError: StringIO instance has no `__call__` method

I know I can write a new class, but is there any better solution?

like image 238
LouYu Avatar asked Dec 29 '15 13:12

LouYu


People also ask

How do you get comments in Python?

A comment in Python starts with the hash character, # , and extends to the end of the physical line. A hash character within a string value is not seen as a comment, though. To be precise, a comment can be written in three ways - entirely on its own line, next to a statement of code, and as a multi-line comment block.

How do I read comments in Python?

A naive way to read a file and skip initial comment lines is to use “if” statement and check if each line starts with the comment character “#”. Python string has a nice method “startswith” to check if a string, in this case a line, starts with specific characters. For example, “#comment”.

How do you print comments in Python?

print("Hello, World!") As long as the string is not assigned to a variable, Python will read the code, but then ignore it, and you have made a multiline comment.

What is comment in Python programming?

Comments in Python are the lines in the code that are ignored by the interpreter during the execution of the program. Comments enhance the readability of the code and help the programmers to understand the code very carefully. There are three types of comments in Python – Single line Comments. Multiline Comments.


2 Answers

Answer for more general cases (extracting from modules, functions):

Modules:

The documentation specifies that one needs to provide a callable which exposes the same interface as the readline() method of built-in file objects. This hints to: create an object that provides that method.

In the case of module, we can just open a new module as a normal file and pass in it's readline method. This is the key, the argument you pass is the method readline().

Given a small scrpt.py file with:

# My amazing foo function.
def foo():
    """ docstring """
    # I will print
    print "Hello"
    return 0   # Return the value

# Maaaaaaain
if __name__ == "__main__":
    # this is main
    print "Main" 

We will open it as we do all files:

fileObj = open('scrpt.py', 'r')

This file object now has a method called readline (because it is a file object) which we can safely pass to tokenize.generate_tokens and create a generator.

tokenize.generate_tokens (simply tokenize.tokenize in Py3 -- Note: Python 3 requires readline return bytes so you'll need to open the file in 'rb' mode) returns a named tuple of elements which contain information about the elements tokenized. Here's a small demo:

for toktype, tok, start, end, line in tokenize.generate_tokens(fileObj.readline):
    # we can also use token.tok_name[toktype] instead of 'COMMENT'
    # from the token module 
    if toktype == tokenize.COMMENT:
        print 'COMMENT' + " " + tok

Notice how we pass the fileObj.readline method to it. This will now print:

COMMENT # My amazing foo function
COMMENT # I will print
COMMENT # Return the value
COMMENT # Maaaaaaain
COMMENT # this is main 

So all comments regardless of position are detected. Docstrings of course are excluded.

Functions:

You could achieve a similar result without open for cases which I really can't think of. Nonetheless, I'll present another way of doing it for completeness sake. In this scenario you'll need two additional modules, inspect and StringIO (io.StringIO in Python3):

Let's say you have the following function:

def bar():
    # I am bar
    print "I really am bar"
    # bar bar bar baaaar
    # (bar)
    return "Bar"

You need a file-like object which has a readline method to use it with tokenize. Well, you can create a file-like object from an str using StringIO.StringIO and you can get an str representing the source of the function with inspect.getsource(func). In code:

funcText = inpsect.getsource(bar)
funcFile = StringIO.StringIO(funcText)

Now we have a file-like object representing the function which has the wanted readline method. We can just re-use the loop we previously performed replacing fileObj.readline with funcFile.readline. The output we get now is of similar nature:

COMMENT # I am bar
COMMENT # bar bar bar baaaar
COMMENT # (bar)

As an aside, if you really want to create a custom way of doing this with re take a look at the source for the tokenize.py module. It defines certain patters for comments, (r'#[^\r\n]*') names et cetera, loops through the lines with readline and searches within the line list for pattterns. Thankfully, it's not too complex after you look at it for a while :-).


Answer for function extract (Update):

You've created an object with StringIO that provides the interface but have you haven't passed that intereface (readline) to tokenize.generate_tokens, instead, you passed the full object (stringio).

Additionally, in your else clause a TypeError is going to be raised because untokenize expects an iterable as input. Making the following changes, your function works fine:

def extract(code):
    res = []
    comment = None
    stringio = StringIO.StringIO(code)
    # pass in stringio.readline to generate_tokens
    for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio.readline):
        if toktype != tokenize.COMMENT:
            res.append((toktype, tokval))
        else:
            # wrap (toktype, tokval) tupple in list
            print tokenize.untokenize([(toktype, tokval)])
    return tokenize.untokenize(res)

Supplied with input of the form expr = extract('a=1+2#A comment') the function will print out the comment and retain the expression in expr:

expr = extract('a=1+2#A comment')
#A comment

print expr
'a =1 +2 '

Furthermore, as I later mention io houses StringIO for Python3 so in this case the import is thankfully not required.

like image 176
Dimitris Fasarakis Hilliard Avatar answered Oct 13 '22 02:10

Dimitris Fasarakis Hilliard


Use this Third-Party Library from PyPI

Comment Parser

like image 35
Shedrack Avatar answered Oct 13 '22 04:10

Shedrack