I'm following the example in Scikit learn docs where CountVectorizer
is used on some dataset.
Question: count_vect.vocabulary_.viewitems()
lists all the terms and their frequencies. How do you sort them by the number of occurances?
sorted( count_vect.vocabulary_.viewitems() )
does not seem to work.
vocabulary_.viewitems()
does not in fact list the terms and their frequencies, instead its a mapping from terms to their indexes. The frequencies (per document) are returned by the fit_transform method, which returns a sparse (coo) matrix, where the rows are documents and columns the words (with column indexes mapped to words via vocabulary_). You can get the total frequencies for example by
matrix = count_vect.fit_transform(doc_list)
freqs = zip(count_vect.get_feature_names(), matrix.sum(axis=0))
# sort from largest to smallest
print sorted(freqs, key=lambda x: -x[1])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With