real word count in NLTK

Tags:

text = nltk.Text(tokens)
len(text)

However, it doesn't - it gives a word and punctuation count. How can you get a real word count (ignoring punctuation)?

Similarly, how can you get the average number of characters in a word? The obvious answer is:

word_average_length =(len(string_of_text)/len(text))

However, this would be off because:

len(string_of_text) is a character count, including spaces
len(text) is a token count, excluding spaces but including punctuation marks, which aren't words.

Am I missing something here? This must be a very common NLP task...

689

asked May 20 '12 20:05

Zach

2 Answers

Tokenization with nltk

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = "This is my text. It icludes commas, question marks? and other stuff. Also U.S.."
tokens = tokenizer.tokenize(text)

Returns

['This', 'is', 'my', 'text', 'It', 'icludes', 'commas', 'question', 'marks', 'and', 'other', 'stuff', 'Also', 'U', 'S']

161

answered Oct 21 '22 21:10

petra

Removing Punctuation

Use a regular expression to filter out the punctuation

import re
from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
>>> filtered = [w for w in text if nonPunct.match(w)]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

Average Number of Characters

Sum the lengths of each word. Divide by the number of words.

>>> float(sum(map(len, filtered))) / len(filtered)
3.75

Or you could make use of the counts you already did to prevent some re-computation. This multiplies the length of the word by the number of times we saw it, then sums all of that up.

>>> float(sum(len(w)*c for w,c in counts.iteritems())) / len(filtered)
3.75

answered Oct 21 '22 22:10

dhg

Related questions
                            
                                Convert decimal to ternary(base3) in python
                            
                                How to serve a static webpage from falcon application?
                            
                                Flask : What exactly is @app [duplicate]
                            
                                xgboost plot importance figure size
                            
                                Selection elements of a list based on another 'True'/'False' list
                            
                                Iterate consecutive elements in a list in Python such that the last element combines with first
                            
                                How I can upgrade my Ubuntu python3.7 to python3.8 latest version?
                            
                                Using base class constructor as factory in Python?
                            
                                python dict.add_by_value(dict_2)?
                            
                                Getting object's parent namespace in python?
                            
                                What is the equivalent of object oriented constructs in python?
                            
                                Why doesn't this division work in Python? [duplicate]
                            
                                A forgiving dictionary
                            
                                Change color of individual print line in Python 3.2? [duplicate]
                            
                                Split three-digit integer to three-item list of each digit in Python
                            
                                cannot import name HttpResponse
                            
                                Count total search objects count in template using django-haystack
                            
                                Python, using multiprocess is slower than not using it
                            
                                Is there a way to create subclasses on-the-fly?
                            
                                get python dictionary from string containing key value pairs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

real word count in NLTK

Tags:

python

nlp

nltk