I am Python newbie trying to understand the answer given here to the question of counting unique words in a document. The answer is:
print len(set(w.lower() for w in open('filename.dat').read().split()))
Reads the entire file into memory, splits it into words using whitespace, converts each word to lower case, creates a (unique) set from the lowercase words, counts them and prints the output
To try understand that, I am trying to implement it in Python step by step. I can import the text tile using open and read, divide it into individual words using split, and make them all lower case using lower. I can also create a set of the unique words in the list. However, I cannot figure out how to do the last part - count the number of unique words.
I thought I could finish by iterating through the items in the set of unique words and counting them in the original lower-case list, but I find that that the set construct is not indexable.
So I guess I am trying to do something that in natural language is like, for all the items in the set, tell me how many times they occur in the lower case list. But I cannot quite figure out how to do that, and I suspect some underlying misunderstanding of Python is holding me back.
Guys thanks for the answers. I have just realised I did not explain myself correctly - I wanted to find not only the total number of unique words (which I understand is the length of the set) but also the number of times each individual word was used, e.g. 'the' was used 14 times, 'and' was used 9 times, 'it' was used 20 times and so on. Apologies for the confusion.
Python: The idea is to use a Dictionary for calculating the count of each word. But first, we have to extract all words from a String because a string may contain punctuation marks. This is done using regex or regular expression. The word which has count 1 in the dictionary is a unique word.
Python String count() The count() method returns the number of occurrences of a substring in the given string.
I believe that Counter is all that you need in this case:
from collections import Counter
print Counter(yourtext.split())
I suppose this can be used to get a unique word count. Works fine with python 3.10.2
from collections import Counter
def get_count_of_unique_words(lines):
selected_words = []
for word in lines:
if word.isalpha():
selected_words.append(word)
unique_count = 0
for letter, count in Counter(selected_words).items():
if count == 1:
unique_count += 1
print(unique_count)
return unique_count
Docs https://docs.python.org/3/library/collections.html#collections.Counter
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With