Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens?

I have been working with the CountVectorizer class in scikit-learn.

I understand that if used in the manner shown below, the final output will consist of an array containing counts of features, or tokens.

These tokens are extracted from a set of keywords, i.e.

tags = [   "python, tools",   "linux, tools, ubuntu",   "distributed systems, linux, networking, tools", ] 

The next step is:

from sklearn.feature_extraction.text import CountVectorizer vec = CountVectorizer(tokenizer=tokenize) data = vec.fit_transform(tags).toarray() print data 

Where we get

[[0 0 0 1 1 0]  [0 1 0 0 1 1]  [1 1 1 0 1 0]] 

This is fine, but my situation is just a little bit different.

I want to extract the features the same way as above, but I don't want the rows in data to be the same documents that the features were extracted from.

In other words, how can I get counts of another set of documents, say,

list_of_new_documents = [   ["python, chicken"],   ["linux, cow, ubuntu"],   ["machine learning, bird, fish, pig"] ] 

And get:

[[0 0 0 1 0 0]  [0 1 0 0 0 1]  [0 0 0 0 0 0]] 

I have read the documentation for the CountVectorizer class, and came across the vocabulary argument, which is a mapping of terms to feature indices. I can't seem to get this argument to help me, however.

Any advice is appreciated.
PS: all credit due to Matthias Friedrich's Blog for the example I used above.

like image 319
tumultous_rooster Avatar asked Apr 07 '14 19:04

tumultous_rooster


People also ask

Is CountVectorizer feature extraction?

This process is called feature extraction (or vectorization). Scikit-learn's CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation.

What is CountVectorizer used for?

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

What does CountVectorizer do in NLP?

CountVectorizer means breaking down a sentence or any text into words by performing preprocessing tasks like converting all words to lowercase, thus removing special characters. In NLP models can't understand textual data they only accept numbers, so this textual data needs to be vectorized.

Can we fit a new CountVectorizer on the test set and use it instead of Vectorizer we fitted on the train set?

CountVectorizer is only a representation (encoding) of the text, you need a classification algorithm in order to train a model on it (for example a decision tree). The model cannot and should not use anything from the test set: First because this is data leakage, which means the evaluation would be wrong.


1 Answers

You're right that vocabulary is what you want. It works like this:

>>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old']) >>> cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray() array([[1, 0, 0],        [0, 1, 0],        [0, 0, 0],        [0, 0, 1]], dtype=int64) 

So you pass it a dict with your desired features as the keys.

If you used CountVectorizer on one set of documents and then you want to use the set of features from those documents for a new set, use the vocabulary_ attribute of your original CountVectorizer and pass it to the new one. So in your example, you could do

newVec = CountVectorizer(vocabulary=vec.vocabulary_) 

to create a new tokenizer using the vocabulary from your first one.

like image 167
BrenBarn Avatar answered Oct 13 '22 22:10

BrenBarn