I am trying to extract some features from a given document, given a pre-defined set of features.
from sklearn.feature_extraction.text import CountVectorizer
features = ['a', 'b', 'c']
doc = ['a', 'c']
vectoriser = CountVectorizer()
vectoriser.vocabulary = features
vectoriser.fit_transform(doc)
However the output, is a 2x3 array, filled in with zeros instead of :
desired_output = [[1, 0, 0]
[0, 0, 1]]
Any help would be much appreciated
What is CountVectorizer In NLP? CountVectorizer means breaking down a sentence or any text into words by performing preprocessing tasks like converting all words to lowercase, thus removing special characters.
CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.
CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.
The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.
This is because the default token pattern in CountVectorizer will get rid of any words of only one character long. You can change the default token pattern to fix this:
from sklearn.feature_extraction.text import CountVectorizer
features = ['a', 'b', 'c']
doc = ['a', 'c']
vectoriser = CountVectorizer(vocabulary=features, token_pattern=r"\b\w+\b")
vectoriser.fit_transform(doc)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With