I am Python newbie trying to understand the answer given here to the question of counting unique words in a document. The answer is: <pre class="prettyprint"><code>print len(set(w.lower() for w in open('filename.dat').read().split())) </code></pre> <blockquote> Reads the entire file into memory, splits it into words using whitespace, converts each word to lower case, creates a (unique) set from the lowercase words, counts them and prints the output </blockquote> To try understand that, I am trying to implement it in Python step by step. I can import the text tile using open and read, divide it into individual words using split, and make them all lower case using lower. I can also create a set of the unique words in the list. However, I cannot figure out how to do the last part - count the number of unique words. I thought I could finish by iterating through the items in the set of unique words and counting them in the original lower-case list, but I find that that the set construct is not indexable. So I guess I am trying to do something that in natural language is like, for all the items in the set, tell me how many times they occur in the lower case list. But I cannot quite figure out how to do that, and I suspect some underlying misunderstanding of Python is holding me back. <ul> <li>EDIT - </li> </ul> Guys thanks for the answers. I have just realised I did not explain myself correctly - I wanted to find not only the total number of unique words (which I understand is the length of the set) but also the number of times each individual word was used, e.g. 'the' was used 14 times, 'and' was used 9 times, 'it' was used 20 times and so on. Apologies for the confusion.

I believe that Counter is all that you need in this case: <pre class="prettyprint"><code>from collections import Counter print Counter(yourtext.split()) </code></pre>

Counting the number of unique words in a document with Python

Tags:

python

I am Python newbie trying to understand the answer given here to the question of counting unique words in a document. The answer is:

print len(set(w.lower() for w in open('filename.dat').read().split()))

Reads the entire file into memory, splits it into words using whitespace, converts each word to lower case, creates a (unique) set from the lowercase words, counts them and prints the output

To try understand that, I am trying to implement it in Python step by step. I can import the text tile using open and read, divide it into individual words using split, and make them all lower case using lower. I can also create a set of the unique words in the list. However, I cannot figure out how to do the last part - count the number of unique words.

I thought I could finish by iterating through the items in the set of unique words and counting them in the original lower-case list, but I find that that the set construct is not indexable.

So I guess I am trying to do something that in natural language is like, for all the items in the set, tell me how many times they occur in the lower case list. But I cannot quite figure out how to do that, and I suspect some underlying misunderstanding of Python is holding me back.

EDIT -

Guys thanks for the answers. I have just realised I did not explain myself correctly - I wanted to find not only the total number of unique words (which I understand is the length of the set) but also the number of times each individual word was used, e.g. 'the' was used 14 times, 'and' was used 9 times, 'it' was used 20 times and so on. Apologies for the confusion.

601

asked Jun 06 '11 17:06

Stephen Flanagan

2 Answers

I believe that Counter is all that you need in this case:

from collections import Counter

print Counter(yourtext.split())

answered Oct 13 '22 10:10

Artsiom Rudzenka

I suppose this can be used to get a unique word count. Works fine with python 3.10.2

from collections import Counter

def get_count_of_unique_words(lines):
    selected_words = []
    for word in lines:
        if word.isalpha():
           selected_words.append(word)

    unique_count = 0
    for letter, count in Counter(selected_words).items():
        if count == 1:
            unique_count += 1

    print(unique_count)
    return unique_count

Docs https://docs.python.org/3/library/collections.html#collections.Counter

answered Oct 13 '22 09:10

Michal

Related questions
                            
                                Replacing -inf values to np.nan in a feature pandas.series [duplicate]
                            
                                How does pytorch broadcasting work?
                            
                                Alternative methods of initializing floats to '+inf', '-inf' and 'nan'
                            
                                Calling cuda() with async results in SyntaxError
                            
                                Pandas GroupBy.agg() throws TypeError: aggregate() missing 1 required positional argument: 'arg'
                            
                                How to base64 encode a PDF file in Python
                            
                                Is there a value in using map() vs for?
                            
                                Count lines of code in a Django Project
                            
                                Parsing Meaning from Text
                            
                                Faster method of reading screen pixel in Python than PIL?
                            
                                why defined '__new__' and '__init__' all in a class
                            
                                Pygame programs hanging on exit
                            
                                starting Python IDLE from command line to edit scripts
                            
                                How to sort a dictionary having keys as a string of numbers in Python
                            
                                Update model instance with dynamic field names
                            
                                How would I determine zodiac / astrological star sign from a birthday in Python?
                            
                                Concatenate two 32 bit int to get a 64 bit long in Python
                            
                                Replace Backslashes with Forward Slashes in Python
                            
                                Postponing functions in python
                            
                                Python or Java for text processing (text mining, information retrieval, natural language processing) [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With