I'm having trouble interpreting the matrix output for the Tfidf vectorizer.
Given
vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000,
min_df=2, stop_words='english',
use_idf=True)
X_train_tfidf = vectorizer.fit_transform(X_train_raw)
If I were to look at the output of X_train_tfidf
, am I looking at a matrix that is structured like:
Column 1 corresponds to document 1 where its elements are tfidf scores of the 10000 features, Column 2 corresponds to document 2... and so on?
Assuming you're seeing output similar to this:
(0, 18) 0.424688479366
(0, 6) 0.424688479366
(0, 4) 0.424688479366
(0, 14) 0.239262081323
(0, 17) 0.202366335916
(0, 5) 0.424688479366
(0, 1) 0.424688479366
(1, 17) 0.184426607226
(1, 8) 0.387039944282
(1, 15) 0.387039944282
(1, 0) 0.387039944282
(1, 2) 0.387039944282
(1, 13) 0.387039944282
(1, 7) 0.387039944282
(1, 11) 0.259205161463
(2, 14) 0.313686744222
(2, 17) 0.530628478217
(2, 9) 0.556791722552
(2, 16) 0.556791722552
(3, 14) 0.346483013718
(3, 17) 0.293053113789
(3, 11) 0.411875926253
(3, 10) 0.61500486583
(3, 3) 0.496182053366
(4, 14) 0.346483013718
(4, 17) 0.293053113789
(4, 11) 0.411875926253
(4, 3) 0.496182053366
(4, 12) 0.61500486583
Assume general form: (A,B) C
A: Document index B: Specific word-vector index C: TFIDF score for word B in document A
This is a sparse matrix. It indicates the tfidf score for all non-zero values in the word vector for each document.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With