Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I tokenize a string sentence in NLTK?

I am using nltk, so I want to create my own custom texts just like the default ones on nltk.books. However, I've just got up to the method like

my_text = ['This', 'is', 'my', 'text'] 

I'd like to discover any way to input my "text" as:

my_text = "This is my text, this is a nice way to input text." 

Which method, python's or from nltk allows me to do this. And more important, how can I dismiss punctuation symbols?

like image 850
diegoaguilar Avatar asked Feb 24 '13 23:02

diegoaguilar


People also ask

How do you use tokenization in a sentence?

Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.

How do you Tokenize a text file in Python nltk?

Open the file with the context manager with open(...) as x , read the file line by line with a for-loop. tokenize the line with word_tokenize() output to your desired format (with the write flag set)

How do you Tokenize a sentence in Java?

Example of hasMoreTokens() method of the StringTokenizer class. This method returns true if more tokens are available in the tokenizer String otherwise returns false. The above Java program shows the use of two methods hasMoreTokens() and nextToken() of StringTokenizer class.


1 Answers

This is actually on the main page of nltk.org:

>>> import nltk >>> sentence = """At eight o'clock on Thursday morning ... Arthur didn't feel very good.""" >>> tokens = nltk.word_tokenize(sentence) >>> tokens ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] 
like image 186
Pavel Anossov Avatar answered Sep 25 '22 19:09

Pavel Anossov