Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tokenize python code using the Tokenize module?

Consider that I have a string that contains the python code.

input = "import nltk
 from nltk.stem import PorterStemmer
 porter_stemmer=PorterStemmer()
 words=["connect","connected","connection","connections","connects"]
 stemmed_words=[porter_stemmer.stem(word) for word in words]
 stemmed_words"

How can I tokenize the code? I found the tokenize module (https://docs.python.org/3/library/tokenize.html). However, it is not clear to me how to use the module. It has tokenize.tokenize(readline) but the parameter takes a generator, not a string.

like image 309
Muhammad Asaduzzaman Avatar asked Nov 04 '25 16:11

Muhammad Asaduzzaman


1 Answers

import tokenize
import io

inp = """import nltk
 from nltk.stem import PorterStemmer
 porter_stemmer=PorterStemmer()
 words=["connect","connected","connection","connections","connects"]
 stemmed_words=[porter_stemmer.stem(word) for word in words]
 stemmed_words"""

for token in tokenize.generate_tokens(io.StringIO(inp).readline):
 print(token)

tokenize.tokenize takes a method not a string. The method should be a readline method from an IO object. In addition, tokenize.tokenize expects the readline method to return bytes, you can use tokenize.generate_tokens instead to use a readline method that returns strings.

Your input should also be in a docstring, as it is multiple lines long.

See io.TextIOBase, tokenize.generate_tokens for more info.

like image 66
Minion3665 Avatar answered Nov 06 '25 09:11

Minion3665



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!