I am using nltk, so I want to create my own custom texts just like the default ones on nltk.books. However, I've just got up to the method like
my_text = ['This', 'is', 'my', 'text']
I'd like to discover any way to input my "text" as:
my_text = "This is my text, this is a nice way to input text."
Which method, python's or from nltk allows me to do this. And more important, how can I dismiss punctuation symbols?
Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.
Open the file with the context manager with open(...) as x , read the file line by line with a for-loop. tokenize the line with word_tokenize() output to your desired format (with the write flag set)
Example of hasMoreTokens() method of the StringTokenizer class. This method returns true if more tokens are available in the tokenizer String otherwise returns false. The above Java program shows the use of two methods hasMoreTokens() and nextToken() of StringTokenizer class.
This is actually on the main page of nltk.org:
>>> import nltk >>> sentence = """At eight o'clock on Thursday morning ... Arthur didn't feel very good.""" >>> tokens = nltk.word_tokenize(sentence) >>> tokens ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With