I'm trying to understand the following code
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?']
X = vectorizer.fit_transform(corpus)
When I try to print X to see what will be return, I got this result :
(0, 1) 1
(0, 2) 1
(0, 6) 1
(0, 3) 1
(0, 8) 1
(1, 5) 2
(1, 1) 1
(1, 6) 1
(1, 3) 1
(1, 8) 1
(2, 4) 1
(2, 7) 1
(2, 0) 1
(2, 6) 1
(3, 1) 1
(3, 2) 1
(3, 6) 1
(3, 3) 1
(3, 8) 1
However, I don't understand the meaning of this result ?
fit_transform() is used on the training data so that we can scale the training data and also learn the scaling parameters of that data. Here, the model built by us will learn the mean and variance of the features of the training set. These learned parameters are then used to scale our test data.
The fit(data) method is used to compute the mean and std dev for a given feature to be used further for scaling. The transform(data) method is used to perform scaling using mean and std dev calculated using the . fit() method. The fit_transform() method does both fits and transform.
The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.
This fit_transform() method is basically the combination of fit method and transform method, it is equivalent to fit(). transform(). This method performs fit and transform on the input data at a single time and converts the data points.
You can interpret this as "(sentence_index, feature_index) count"
As there are 3 sentence: it starts from 0 and ends at 2
feature index is word index which u can get from vectorizer.vocabulary_
-> vocabulary_ a dictionary {word:feature_index,...}
so for the example (0, 1) 1
-> 0 : row[the sentence index]
-> 1 : get feature index(i.e. the word) from vectorizer.vocabulary_[1]
-> 1 : count/tfidf (as you have used a count vectorizer, it will give you count)
instead of count vectorizer, if you use tfidf vectorizersee here it will give u tfidf values. I hope I made it clear
As @Himanshu writes, this is a "(sentence_index, feature_index) count"
Here, the count part is the "number of times a word appears in a document"
For example,
(0, 1) 1
(0, 2) 1
(0, 6) 1
(0, 3) 1
(0, 8) 1
(1, 5) 2 Only for this example, the count "2" tells that the word "and" appears twice in this document/sentence
(1, 1) 1
(1, 6) 1
(1, 3) 1
(1, 8) 1
(2, 4) 1
(2, 7) 1
(2, 0) 1
(2, 6) 1
(3, 1) 1
(3, 2) 1
(3, 6) 1
(3, 3) 1
(3, 8) 1
Let's change the corpus in your code. Basically, I added the word "second" twice in the second sentence of the corpus list.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = ['This is the first document.','This is the second second second second document.','And the third one.','Is this the first document?']
X = vectorizer.fit_transform(corpus)
(0, 1) 1
(0, 2) 1
(0, 6) 1
(0, 3) 1
(0, 8) 1
(1, 5) 4 for the modified corpus, the count "4" tells that the word "second" appears four times in this document/sentence
(1, 1) 1
(1, 6) 1
(1, 3) 1
(1, 8) 1
(2, 4) 1
(2, 7) 1
(2, 0) 1
(2, 6) 1
(3, 1) 1
(3, 2) 1
(3, 6) 1
(3, 3) 1
(3, 8) 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With