How vectorizer fit_transform work in sklearn?

Tags:

I'm trying to understand the following code

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

When I try to print X to see what will be return, I got this result :

(0, 1)  1

(0, 2)  1

(0, 6)  1

(0, 3)  1

(0, 8)  1

(1, 5)  2

(1, 1)  1

(1, 6)  1

(1, 3)  1

(1, 8)  1

(2, 4)  1

(2, 7)  1

(2, 0)  1

(2, 6)  1

(3, 1)  1

(3, 2)  1

(3, 6)  1

(3, 3)  1

(3, 8)  1

However, I don't understand the meaning of this result ?

666

asked Dec 20 '17 03:12

Leo

2 Answers

You can interpret this as "(sentence_index, feature_index) count"

As there are 3 sentence: it starts from 0 and ends at 2

feature index is word index which u can get from vectorizer.vocabulary_

-> vocabulary_ a dictionary {word:feature_index,...}

so for the example (0, 1) 1

-> 0 : row[the sentence index]

-> 1 : get feature index(i.e. the word) from vectorizer.vocabulary_[1]

-> 1 : count/tfidf (as you have used a count vectorizer, it will give you count)

instead of count vectorizer, if you use tfidf vectorizersee here it will give u tfidf values. I hope I made it clear

197

answered Sep 27 '22 23:09

Himanshu Kriplani

As @Himanshu writes, this is a "(sentence_index, feature_index) count"

Here, the count part is the "number of times a word appears in a document"

For example,

(0, 1) 1

(0, 2) 1

(0, 6) 1

(0, 3) 1

(0, 8) 1

(1, 5) 2 Only for this example, the count "2" tells that the word "and" appears twice in this document/sentence

(1, 1) 1

(1, 6) 1

(1, 3) 1

(1, 8) 1

(2, 4) 1

(2, 7) 1

(2, 0) 1

(2, 6) 1

(3, 1) 1

(3, 2) 1

(3, 6) 1

(3, 3) 1

(3, 8) 1

Let's change the corpus in your code. Basically, I added the word "second" twice in the second sentence of the corpus list.

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

(0, 1) 1

(0, 2) 1

(0, 6) 1

(0, 3) 1

(0, 8) 1

(1, 5) 4 for the modified corpus, the count "4" tells that the word "second" appears four times in this document/sentence

(1, 1) 1

(1, 6) 1

(1, 3) 1

(1, 8) 1

(2, 4) 1

(2, 7) 1

(2, 0) 1

(2, 6) 1

(3, 1) 1

(3, 2) 1

(3, 6) 1

(3, 3) 1

(3, 8) 1

answered Sep 27 '22 23:09

Anjani Anjani

Related questions
                            
                                Why do different variable names get different results（python2.7）? [duplicate]
                            
                                Passing list of numpy arrays to C using cython
                            
                                Moving title above the colorbar in Seaborn heatmap
                            
                                compare two time series (simulation results)
                            
                                sympy - symbolic sum over symbolic number of elements
                            
                                Pickle a dict subclass without __reduce__ method does not load member attributes
                            
                                How to configure uwsgi to encode logging as json except app output
                            
                                Subclassing multiprocessing.managers.BaseProxy
                            
                                Error in `python': free(): invalid pointer: 0x00007fc3c90dc98e
                            
                                Graphene Mutation error, fields must be a mapping (dict / OrderedDict)
                            
                                Can pip list its binary wheels?
                            
                                Tensorflow Dataset.from_tensor_slices taking too long
                            
                                Python Pandas Key Error When Trying to Access Index
                            
                                How do I debug an error in `ast.literal_eval`?
                            
                                ImportError: No module named flask_login even though I have it installed
                            
                                python multiprocessing Pool not always using all workers
                            
                                Most efficient way to groupby => aggregate for large dataframe in pandas
                            
                                How to make ttk.Scale behave more like tk.Scale?
                            
                                "TypeError: 'Tensor' object is not iterable" error with tensorflow Estimator
                            
                                How to bundle cx_oracle with Pyinstaller

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How vectorizer fit_transform work in sklearn?

Tags:

python

machine-learning

scikit-learn

Leo

People also ask

2 Answers

Himanshu Kriplani

Anjani Anjani

Recent Activity

Donate For Us