List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

Tags:

I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example

'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on

Is there any built-in function for this?

508

asked Apr 18 '13 08:04

user1506145

1 Answers

If cv is your CountVectorizer and X is the vectorized corpus, then

zip(cv.get_feature_names(),
    np.asarray(X.sum(axis=0)).ravel())

returns a list of (term, frequency) pairs for each distinct term in the corpus that the CountVectorizer extracted.

(The little asarray + ravel dance is needed to work around some quirks in scipy.sparse.)

146

answered Oct 15 '22 17:10

Fred Foo

Related questions
                            
                                Pip installing into an older Python version
                            
                                SEARCH BEFORE/AFTER with Pythons imaplib
                            
                                Python - escaping double quotes using string.replace
                            
                                Getting a list from a config file with ConfigParser
                            
                                How to install Python module on Ubuntu
                            
                                Can Pickle handle multiple object references
                            
                                Email parsing: TypeError: parse() takes at least 2 arguments (2 given)
                            
                                Python Scapy wrpcap - How do you append packets to a pcap file?
                            
                                Callable as the default argument to dict.get without it being called if the key exists
                            
                                Consistent way to redirect both stdin & stdout to files in python using optparse
                            
                                /usr/bin/ld: cannot find -lpython2.7
                            
                                Python - Update a value in a list of tuples
                            
                                How to create <!DOCTYPE> with Python's cElementTree
                            
                                Python OpenCV - Find black areas in a binary image
                            
                                How to debug: Internal Error current transaction is aborted, commands ignored until end of transaction block
                            
                                How to parse user agent string using Python
                            
                                how to make a socket server listen on local file [closed]
                            
                                Matplotlib: figlegend only printing first letter
                            
                                Coloring exceptions from Python on a terminal
                            
                                How to increase connection timeout using sqlalchemy with sqlite in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

Tags:

python

machine-learning

text-extraction

scikit-learn

countvectorizer

user1506145

People also ask

1 Answers

Fred Foo

Recent Activity

Donate For Us