Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example

'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on

Is there any built-in function for this?

like image 508
user1506145 Avatar asked Apr 18 '13 08:04

user1506145


People also ask

What is CountVectorizer in Sklearn?

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

Is CountVectorizer bag of words?

Count vectorizer creates a matrix with documents and token counts (bag of terms/tokens) therefore it is also known as document term matrix (dtm).

What is Vectorizer Vocabulary_?

Explanation: vocabulary_ is a dict where keys are terms and values are indices in the feature matrix. CountVectorizer converts a collection of text documents to a matrix of token counts. It produces a sparse Matrix of the counts of each word from the vocabulary.


1 Answers

If cv is your CountVectorizer and X is the vectorized corpus, then

zip(cv.get_feature_names(),
    np.asarray(X.sum(axis=0)).ravel())

returns a list of (term, frequency) pairs for each distinct term in the corpus that the CountVectorizer extracted.

(The little asarray + ravel dance is needed to work around some quirks in scipy.sparse.)

like image 146
Fred Foo Avatar answered Oct 15 '22 17:10

Fred Foo