Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting the number of unique words in a document with Python

Tags:

python

I am Python newbie trying to understand the answer given here to the question of counting unique words in a document. The answer is:

print len(set(w.lower() for w in open('filename.dat').read().split()))

Reads the entire file into memory, splits it into words using whitespace, converts each word to lower case, creates a (unique) set from the lowercase words, counts them and prints the output

To try understand that, I am trying to implement it in Python step by step. I can import the text tile using open and read, divide it into individual words using split, and make them all lower case using lower. I can also create a set of the unique words in the list. However, I cannot figure out how to do the last part - count the number of unique words.

I thought I could finish by iterating through the items in the set of unique words and counting them in the original lower-case list, but I find that that the set construct is not indexable.

So I guess I am trying to do something that in natural language is like, for all the items in the set, tell me how many times they occur in the lower case list. But I cannot quite figure out how to do that, and I suspect some underlying misunderstanding of Python is holding me back.

  • EDIT -

Guys thanks for the answers. I have just realised I did not explain myself correctly - I wanted to find not only the total number of unique words (which I understand is the length of the set) but also the number of times each individual word was used, e.g. 'the' was used 14 times, 'and' was used 9 times, 'it' was used 20 times and so on. Apologies for the confusion.

like image 601
Stephen Flanagan Avatar asked Jun 06 '11 17:06

Stephen Flanagan


People also ask

How do you find unique words in a string Python?

Python: The idea is to use a Dictionary for calculating the count of each word. But first, we have to extract all words from a String because a string may contain punctuation marks. This is done using regex or regular expression. The word which has count 1 in the dictionary is a unique word.

How do you count occurrences of a word in a string in Python?

Python String count() The count() method returns the number of occurrences of a substring in the given string.


2 Answers

I believe that Counter is all that you need in this case:

from collections import Counter

print Counter(yourtext.split())
like image 72
Artsiom Rudzenka Avatar answered Oct 13 '22 10:10

Artsiom Rudzenka


I suppose this can be used to get a unique word count. Works fine with python 3.10.2

from collections import Counter

def get_count_of_unique_words(lines):
    selected_words = []
    for word in lines:
        if word.isalpha():
           selected_words.append(word)

    unique_count = 0
    for letter, count in Counter(selected_words).items():
        if count == 1:
            unique_count += 1

    print(unique_count)
    return unique_count

Docs https://docs.python.org/3/library/collections.html#collections.Counter

like image 39
Michal Avatar answered Oct 13 '22 09:10

Michal