How do I get word frequency in a corpus using Scikit Learn CountVectorizer?

Tags:

scikit-learn

I'm trying to compute a simple word frequency using scikit-learn's CountVectorizer.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird","bird"]
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print cv.vocabulary_
{u'bird': 0, u'cat': 1, u'dog': 2, u'fish': 3}

I was expecting it to return {u'bird': 2, u'cat': 3, u'dog': 2, u'fish': 2}.

908

asked Dec 15 '14 16:12

3 Answers

cv.vocabulary_ in this instance is a dict, where the keys are the words (features) that you've found and the values are indices, which is why they're 0, 1, 2, 3. It's just bad luck that it looked similar to your counts :)

You need to work with the cv_fit object to get the counts

from sklearn.feature_extraction.text import CountVectorizer

texts = ["dog cat fish", "dog cat cat", "fish bird", "bird"]
cv = CountVectorizer()
cv_fit = cv.fit_transform(texts)

print(cv.get_feature_names())
print(cv_fit.toarray())
# ["bird", "cat", "dog", "fish"]
# [[0 1 1 1]
#  [0 2 1 0]
#  [1 0 0 1]
#  [1 0 0 0]]

Each row in the array is one of your original documents (strings), each column is a feature (word), and the element is the count for that particular word and document. You can see that if you sum each column you'll get the correct number

print(cv_fit.toarray().sum(axis=0))
# [2 3 2 2]

Honestly though, I'd suggest using collections.Counter or something from NLTK, unless you have some specific reason to use scikit-learn, as it'll be simpler.

answered Oct 18 '22 04:10

Ffisegydd

cv_fit.toarray().sum(axis=0) definitely gives the correct result, but it will be much faster to perform the sum on the sparse matrix and then transform it to an array:

np.asarray(cv_fit.sum(axis=0))

answered Oct 18 '22 02:10

pieterbons

We are going to use the zip method to make dict from a list of words and list of their counts

import pandas as pd
import numpy as np    
from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish", "dog cat cat", "fish bird", "bird"]    

cv = CountVectorizer()   
cv_fit = cv.fit_transform(texts)    
word_list = cv.get_feature_names() 
count_list = cv_fit.toarray().sum(axis=0)

The outputs are following:

>> print word_list
['bird', 'cat', 'dog', 'fish']    
>> print count_list
[2 3 2 2]    
>> print dict(zip(word_list,count_list))
{'fish': 2, 'dog': 2, 'bird': 2, 'cat': 3}

answered Oct 18 '22 04:10

YASH GUPTA

Related questions
                            
                                'Module object has no attribute 'get' Python error Requests?
                            
                                Python command line: ignore indentation
                            
                                Count consecutive characters
                            
                                ImportError: cannot import name 'ensure_dir_exists'
                            
                                Create a dictionary by zipping together two lists of uneven length [duplicate]
                            
                                How to group and count rows by month and year using Pandas?
                            
                                How do I change directory back to my original working directory with Python?
                            
                                the bytes type in python 2.7 and PEP-358
                            
                                How to delete all rows in a dataframe?
                            
                                Calculating Slopes in Numpy (or Scipy)
                            
                                Loop through all CSV files in a folder
                            
                                Get week start date (Monday) from a date column in Python (pandas)?
                            
                                Use of eval in Python?
                            
                                OpenCV videowrite doesn't write video
                            
                                InvalidBasesError: Cannot resolve bases for [<ModelState: 'users.GroupProxy'>]
                            
                                how to catch the MultipleObjectsReturned error in django
                            
                                Does a heaviside step function exist?
                            
                                Spyder missing Object Inspector
                            
                                how to change the case of first letter of a string?
                            
                                combine multiple text files into one text file using python [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I get word frequency in a corpus using Scikit Learn CountVectorizer?

Tags:

python

scikit-learn

Adrien

People also ask

3 Answers

Ffisegydd

pieterbons

YASH GUPTA

Recent Activity

Donate For Us