The NLTK book has a couple of examples of word counts, but in reality they are not word counts but token counts. For instance, Chapter 1, Counting Vocabulary says that the following gives a word count:
text = nltk.Text(tokens)
len(text)
However, it doesn't - it gives a word and punctuation count. How can you get a real word count (ignoring punctuation)?
Similarly, how can you get the average number of characters in a word? The obvious answer is:
word_average_length =(len(string_of_text)/len(text))
However, this would be off because:
Am I missing something here? This must be a very common NLP task...
The NLTK book has a couple of examples of word counts, but in reality they are not word counts but token counts. For instance, Chapter 1, Counting Vocabulary says that the following gives a word count: text = nltk.Text(tokens) len(text) However, it doesn't - it gives a word and punctuation count.
Words Frequency Distribution Here is the summary of what you learned in this post regarding reading and processing the text file using NLTK library: Class nltk.corpus.PlaintextCorpusReader can be used to read the files from the local storage. Once the file is loaded, method words can be used to read the words from the text file.
Those two words appearing together is a collocation. Collocations can be made up of two or more words. NLTK provides classes to handle several types of collocations: NLTK provides specific classes for you to find collocations in your text. Following the pattern you’ve seen so far, these classes are also built from lists of words:
You can use WordNet to get synonymous words like this: Cool!! You can get the antonyms words the same way, all you have to do is to check the lemmas before adding them to the array if it’s an antonym or not. This is the power of NLTK in natural language processing. Word stemming means removing affixes from words and return the root word.
Tokenization with nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = "This is my text. It icludes commas, question marks? and other stuff. Also U.S.."
tokens = tokenizer.tokenize(text)
Returns
['This', 'is', 'my', 'text', 'It', 'icludes', 'commas', 'question', 'marks', 'and', 'other', 'stuff', 'Also', 'U', 'S']
Use a regular expression to filter out the punctuation
import re
from collections import Counter
>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> nonPunct = re.compile('.*[A-Za-z0-9].*') # must contain a letter or digit
>>> filtered = [w for w in text if nonPunct.match(w)]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})
Sum the lengths of each word. Divide by the number of words.
>>> float(sum(map(len, filtered))) / len(filtered)
3.75
Or you could make use of the counts you already did to prevent some re-computation. This multiplies the length of the word by the number of times we saw it, then sums all of that up.
>>> float(sum(len(w)*c for w,c in counts.iteritems())) / len(filtered)
3.75
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With