Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get word frequency in a corpus using Scikit Learn CountVectorizer?

I'm trying to compute a simple word frequency using scikit-learn's CountVectorizer.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird","bird"]
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print cv.vocabulary_
{u'bird': 0, u'cat': 1, u'dog': 2, u'fish': 3}

I was expecting it to return {u'bird': 2, u'cat': 3, u'dog': 2, u'fish': 2}.

like image 908
Adrien Avatar asked Dec 15 '14 16:12

Adrien


People also ask

How do you use a Sklearn CountVectorizer?

Word Counts with CountVectorizer You can use it as follows: Create an instance of the CountVectorizer class. Call the fit() function in order to learn a vocabulary from one or more documents. Call the transform() function on one or more documents as needed to encode each as a vector.

Is CountVectorizer same as bag of words?

Count vectorizer creates a matrix with documents and token counts (bag of terms/tokens) therefore it is also known as document term matrix (dtm).

What is Ngram_range in CountVectorizer?

CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.


3 Answers

cv.vocabulary_ in this instance is a dict, where the keys are the words (features) that you've found and the values are indices, which is why they're 0, 1, 2, 3. It's just bad luck that it looked similar to your counts :)

You need to work with the cv_fit object to get the counts

from sklearn.feature_extraction.text import CountVectorizer

texts = ["dog cat fish", "dog cat cat", "fish bird", "bird"]
cv = CountVectorizer()
cv_fit = cv.fit_transform(texts)

print(cv.get_feature_names())
print(cv_fit.toarray())
# ["bird", "cat", "dog", "fish"]
# [[0 1 1 1]
#  [0 2 1 0]
#  [1 0 0 1]
#  [1 0 0 0]]

Each row in the array is one of your original documents (strings), each column is a feature (word), and the element is the count for that particular word and document. You can see that if you sum each column you'll get the correct number

print(cv_fit.toarray().sum(axis=0))
# [2 3 2 2]

Honestly though, I'd suggest using collections.Counter or something from NLTK, unless you have some specific reason to use scikit-learn, as it'll be simpler.

like image 61
Ffisegydd Avatar answered Oct 18 '22 04:10

Ffisegydd


cv_fit.toarray().sum(axis=0) definitely gives the correct result, but it will be much faster to perform the sum on the sparse matrix and then transform it to an array:

np.asarray(cv_fit.sum(axis=0))
like image 26
pieterbons Avatar answered Oct 18 '22 02:10

pieterbons


We are going to use the zip method to make dict from a list of words and list of their counts

import pandas as pd
import numpy as np    
from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish", "dog cat cat", "fish bird", "bird"]    

cv = CountVectorizer()   
cv_fit = cv.fit_transform(texts)    
word_list = cv.get_feature_names() 
count_list = cv_fit.toarray().sum(axis=0)

The outputs are following:

>> print word_list
['bird', 'cat', 'dog', 'fish']    
>> print count_list
[2 3 2 2]    
>> print dict(zip(word_list,count_list))
{'fish': 2, 'dog': 2, 'bird': 2, 'cat': 3}
like image 21
YASH GUPTA Avatar answered Oct 18 '22 04:10

YASH GUPTA