Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding the matrix output of Tfidfvectorizer in Sklearn

I'm having trouble interpreting the matrix output for the Tfidf vectorizer.

Given

vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000,
                         min_df=2, stop_words='english',
                         use_idf=True)


X_train_tfidf = vectorizer.fit_transform(X_train_raw)

If I were to look at the output of X_train_tfidf, am I looking at a matrix that is structured like:

Column 1 corresponds to document 1 where its elements are tfidf scores of the 10000 features, Column 2 corresponds to document 2... and so on?

like image 379
RedShift Avatar asked Oct 26 '17 16:10

RedShift


1 Answers

Assuming you're seeing output similar to this:

(0, 18)       0.424688479366
(0, 6)        0.424688479366
(0, 4)        0.424688479366
(0, 14)       0.239262081323
(0, 17)       0.202366335916
(0, 5)        0.424688479366
(0, 1)        0.424688479366
(1, 17)       0.184426607226
(1, 8)        0.387039944282
(1, 15)       0.387039944282
(1, 0)        0.387039944282
(1, 2)        0.387039944282
(1, 13)       0.387039944282
(1, 7)        0.387039944282
(1, 11)       0.259205161463
(2, 14)       0.313686744222
(2, 17)       0.530628478217
(2, 9)        0.556791722552
(2, 16)       0.556791722552
(3, 14)       0.346483013718
(3, 17)       0.293053113789
(3, 11)       0.411875926253
(3, 10)       0.61500486583
(3, 3)        0.496182053366
(4, 14)       0.346483013718
(4, 17)       0.293053113789
(4, 11)       0.411875926253
(4, 3)        0.496182053366
(4, 12)       0.61500486583

Assume general form: (A,B) C

A: Document index B: Specific word-vector index C: TFIDF score for word B in document A

This is a sparse matrix. It indicates the tfidf score for all non-zero values in the word vector for each document.

like image 63
BassFaceIV Avatar answered Oct 22 '22 03:10

BassFaceIV