I'm trying to write a program to extract comments in code that user enters. I tried to use regex, but found it difficult to write.
Then I found a post here. The answer suggests to use tokenize.generate_tokens
to analyze the grammar, but the documentation says:
The
generate_tokens()
generator requires one argument,readline
, which must be a callable object which provides the same interface as thereadline()
method of built-in file objects (see section File Objects).
But a string object does not have readline
method.
Then I found another post here, suggesting to use StringIO.StringIO
to get a readline
method. So I wrote the following code:
import tokenize
import io
import StringIO
def extract(code):
res = []
comment = None
stringio = StringIO.StringIO(code)
for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio):
# print(toknum,tokval)
if toktype != tokenize.COMMENT:
res.append((toktype, tokval))
else:
print tokenize.untokenize(toktype)
return tokenize.untokenize(res)
And entered the following code: extract('a = 1+2#A Comment')
But got:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "ext.py", line 10, in extract
for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio):
File "C:\Python27\lib\tokenize.py", line 294, in generate_tokens
line = readline()
AttributeError: StringIO instance has no `__call__` method
I know I can write a new class, but is there any better solution?
A comment in Python starts with the hash character, # , and extends to the end of the physical line. A hash character within a string value is not seen as a comment, though. To be precise, a comment can be written in three ways - entirely on its own line, next to a statement of code, and as a multi-line comment block.
A naive way to read a file and skip initial comment lines is to use “if” statement and check if each line starts with the comment character “#”. Python string has a nice method “startswith” to check if a string, in this case a line, starts with specific characters. For example, “#comment”.
print("Hello, World!") As long as the string is not assigned to a variable, Python will read the code, but then ignore it, and you have made a multiline comment.
Comments in Python are the lines in the code that are ignored by the interpreter during the execution of the program. Comments enhance the readability of the code and help the programmers to understand the code very carefully. There are three types of comments in Python – Single line Comments. Multiline Comments.
The documentation specifies that one needs to provide a callable which exposes the same interface as the readline()
method of built-in file objects. This hints to: create an object that provides that method.
In the case of module, we can just open
a new module as a normal file and pass in it's readline
method.
This is the key, the argument you pass is the method readline()
.
Given a small scrpt.py
file with:
# My amazing foo function.
def foo():
""" docstring """
# I will print
print "Hello"
return 0 # Return the value
# Maaaaaaain
if __name__ == "__main__":
# this is main
print "Main"
We will open it as we do all files:
fileObj = open('scrpt.py', 'r')
This file object now has a method called readline
(because it is a file object) which we can safely pass to tokenize.generate_tokens
and create a generator.
tokenize.generate_tokens
(simply tokenize.tokenize
in Py3 -- Note: Python 3 requires readline return bytes
so you'll need to open the file in 'rb'
mode) returns a named tuple of elements which contain information about the elements tokenized. Here's a small demo:
for toktype, tok, start, end, line in tokenize.generate_tokens(fileObj.readline):
# we can also use token.tok_name[toktype] instead of 'COMMENT'
# from the token module
if toktype == tokenize.COMMENT:
print 'COMMENT' + " " + tok
Notice how we pass the fileObj.readline
method to it. This will now print:
COMMENT # My amazing foo function
COMMENT # I will print
COMMENT # Return the value
COMMENT # Maaaaaaain
COMMENT # this is main
So all comments regardless of position are detected. Docstrings of course are excluded.
You could achieve a similar result without open
for cases which I really can't think of. Nonetheless, I'll present another way of doing it for completeness sake. In this scenario you'll need two additional modules, inspect
and StringIO
(io.StringIO
in Python3
):
Let's say you have the following function:
def bar():
# I am bar
print "I really am bar"
# bar bar bar baaaar
# (bar)
return "Bar"
You need a file-like object which has a readline
method to use it with tokenize
. Well, you can create a file-like object from an str
using StringIO.StringIO
and you can get an str
representing the source of the function with inspect.getsource(func)
. In code:
funcText = inpsect.getsource(bar)
funcFile = StringIO.StringIO(funcText)
Now we have a file-like object representing the function which has the wanted readline
method. We can just re-use the loop we previously performed replacing fileObj.readline
with funcFile.readline
. The output we get now is of similar nature:
COMMENT # I am bar
COMMENT # bar bar bar baaaar
COMMENT # (bar)
As an aside, if you really want to create a custom way of doing this with re
take a look at the source for the tokenize.py
module. It defines certain patters for comments, (r'#[^\r\n]*'
) names et cetera, loops through the lines with readline
and searches within the line
list for pattterns. Thankfully, it's not too complex after you look at it for a while :-).
extract
(Update):You've created an object with StringIO
that provides the interface but have you haven't passed that intereface (readline
) to tokenize.generate_tokens
, instead, you passed the full object (stringio
).
Additionally, in your else
clause a TypeError
is going to be raised because untokenize
expects an iterable as input. Making the following changes, your function works fine:
def extract(code):
res = []
comment = None
stringio = StringIO.StringIO(code)
# pass in stringio.readline to generate_tokens
for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio.readline):
if toktype != tokenize.COMMENT:
res.append((toktype, tokval))
else:
# wrap (toktype, tokval) tupple in list
print tokenize.untokenize([(toktype, tokval)])
return tokenize.untokenize(res)
Supplied with input of the form expr = extract('a=1+2#A comment')
the function will print out the comment and retain the expression in expr
:
expr = extract('a=1+2#A comment')
#A comment
print expr
'a =1 +2 '
Furthermore, as I later mention io
houses StringIO
for Python3 so in this case the import
is thankfully not required.
Use this Third-Party Library from PyPI
Comment Parser
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With