Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CountVectorizer returns only zeros

Tags:

I am trying to extract some features from a given document, given a pre-defined set of features.

from sklearn.feature_extraction.text import CountVectorizer
features = ['a', 'b', 'c']
doc = ['a', 'c']

vectoriser = CountVectorizer()
vectoriser.vocabulary = features
vectoriser.fit_transform(doc)

However the output, is a 2x3 array, filled in with zeros instead of :

desired_output = [[1, 0, 0]
                  [0, 0, 1]]

Any help would be much appreciated

like image 357
Immortalz Avatar asked Mar 06 '17 20:03

Immortalz


People also ask

What does CountVectorizer do in NLP?

What is CountVectorizer In NLP? CountVectorizer means breaking down a sentence or any text into words by performing preprocessing tasks like converting all words to lowercase, thus removing special characters.

What is CountVectorizer in Sklearn?

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

What is Ngram_range in CountVectorizer?

CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.

What is Max features in CountVectorizer?

The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.


1 Answers

This is because the default token pattern in CountVectorizer will get rid of any words of only one character long. You can change the default token pattern to fix this:

from sklearn.feature_extraction.text import CountVectorizer
features = ['a', 'b', 'c']
doc = ['a', 'c']

vectoriser = CountVectorizer(vocabulary=features, token_pattern=r"\b\w+\b")

vectoriser.fit_transform(doc)
like image 116
Kewl Avatar answered Sep 25 '22 10:09

Kewl