I have fitted a CountVectorizer
to some documents in scikit-learn
. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example
'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on
Is there any built-in function for this?
CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.
Count vectorizer creates a matrix with documents and token counts (bag of terms/tokens) therefore it is also known as document term matrix (dtm).
Explanation: vocabulary_ is a dict where keys are terms and values are indices in the feature matrix. CountVectorizer converts a collection of text documents to a matrix of token counts. It produces a sparse Matrix of the counts of each word from the vocabulary.
If cv
is your CountVectorizer
and X
is the vectorized corpus, then
zip(cv.get_feature_names(),
np.asarray(X.sum(axis=0)).ravel())
returns a list of (term, frequency)
pairs for each distinct term in the corpus that the CountVectorizer
extracted.
(The little asarray
+ ravel
dance is needed to work around some quirks in scipy.sparse
.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With