Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

real word count in NLTK

Tags:

python

nlp

nltk

The NLTK book has a couple of examples of word counts, but in reality they are not word counts but token counts. For instance, Chapter 1, Counting Vocabulary says that the following gives a word count:

text = nltk.Text(tokens)
len(text)

However, it doesn't - it gives a word and punctuation count. How can you get a real word count (ignoring punctuation)?

Similarly, how can you get the average number of characters in a word? The obvious answer is:

word_average_length =(len(string_of_text)/len(text))

However, this would be off because:

  1. len(string_of_text) is a character count, including spaces
  2. len(text) is a token count, excluding spaces but including punctuation marks, which aren't words.

Am I missing something here? This must be a very common NLP task...

like image 689
Zach Avatar asked May 20 '12 20:05

Zach


People also ask

How do you count word counts in the NLTK?

The NLTK book has a couple of examples of word counts, but in reality they are not word counts but token counts. For instance, Chapter 1, Counting Vocabulary says that the following gives a word count: text = nltk.Text(tokens) len(text) However, it doesn't - it gives a word and punctuation count.

How to read and process text files using NLTK library?

Words Frequency Distribution Here is the summary of what you learned in this post regarding reading and processing the text file using NLTK library: Class nltk.corpus.PlaintextCorpusReader can be used to read the files from the local storage. Once the file is loaded, method words can be used to read the words from the text file.

What are collocations in NLTK?

Those two words appearing together is a collocation. Collocations can be made up of two or more words. NLTK provides classes to handle several types of collocations: NLTK provides specific classes for you to find collocations in your text. Following the pattern you’ve seen so far, these classes are also built from lists of words:

How do you get synonymous and antonyms with NLTK?

You can use WordNet to get synonymous words like this: Cool!! You can get the antonyms words the same way, all you have to do is to check the lemmas before adding them to the array if it’s an antonym or not. This is the power of NLTK in natural language processing. Word stemming means removing affixes from words and return the root word.


2 Answers

Tokenization with nltk

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = "This is my text. It icludes commas, question marks? and other stuff. Also U.S.."
tokens = tokenizer.tokenize(text)

Returns

['This', 'is', 'my', 'text', 'It', 'icludes', 'commas', 'question', 'marks', 'and', 'other', 'stuff', 'Also', 'U', 'S']
like image 161
petra Avatar answered Oct 21 '22 21:10

petra


Removing Punctuation

Use a regular expression to filter out the punctuation

import re
from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
>>> filtered = [w for w in text if nonPunct.match(w)]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

Average Number of Characters

Sum the lengths of each word. Divide by the number of words.

>>> float(sum(map(len, filtered))) / len(filtered)
3.75

Or you could make use of the counts you already did to prevent some re-computation. This multiplies the length of the word by the number of times we saw it, then sums all of that up.

>>> float(sum(len(w)*c for w,c in counts.iteritems())) / len(filtered)
3.75
like image 37
dhg Avatar answered Oct 21 '22 22:10

dhg