Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make TF-IDF matrix dense?

I am using TfidfVectorizer to convert a collection of raw documents to a matrix of TF-IDF features, which I then plan to input into a k-means algorithm (which I will implement). In that algorithm I will have to compute distances between centroids (categories of articles) and data points (articles). I am going to use Euclidean distance, so I need these two entities to be of same dimension, in my case max_features. Here is what I have:

tfidf = TfidfVectorizer(max_features=10, strip_accents='unicode', analyzer='word', stop_words=stop_words.extra_stopwords, lowercase=True, use_idf=True)
X = tfidf.fit_transform(data['Content']) # the matrix articles x max_features(=words)
for i, row in enumerate(X):
    print X[i]

However X seems to be a sparse(?) matrix, since the output is:

  (0, 9)    0.723131915847
  (0, 8)    0.090245047798
  (0, 6)    0.117465276892
  (0, 4)    0.379981697363
  (0, 3)    0.235921470645
  (0, 2)    0.0968780456528
  (0, 1)    0.495689001273

  (0, 9)    0.624910843051
  (0, 8)    0.545911131362
  (0, 7)    0.160545991411
  (0, 5)    0.49900042174
  (0, 4)    0.191549050212

  ...

Where I think the (0, col) states the column index in the matrix, which actually like an array, where every cell points to a list.

How do I convert this matrix to a dense one (so that every row has the same number of columns)?


>print type(X)
<class 'scipy.sparse.csr.csr_matrix'>
like image 905
gsamaras Avatar asked Jan 31 '16 01:01

gsamaras


People also ask

Is TF-IDF sparse or dense?

TfidfVectorizer usually creates sparse data. If the data is sparse enough, matrices usually stays as sparse all along the pipeline until the predictor is trained.

How do you make a sparse matrix dense?

1 Answer. You can use either todense() or toarray() function to convert a CSR matrix to a dense matrix.

How do I interpret my TF-IDF score?

Putting it together: TF-IDF By multiplying these values together we can get our final TF-IDF value. The higher the TF-IDF score the more important or relevant the term is; as a term gets less relevant, its TF-IDF score will approach 0.

What is the difference between CountVectorizer and TfidfVectorizer?

The main difference between the 2 implementations is that TfidfVectorizer performs both term frequency and inverse document frequency for you, while using TfidfTransformer will require you to use the CountVectorizer class from Scikit-Learn to perform Term Frequency.


1 Answers

This should be as simple as:

dense = X.toarray()

TfIdfVectorizer.fit_transform() is returning a SciPy csr_matrix() (Compressed Sparse Row Matrix), which has a toarray() method just for this purpose. There are several formats of sparse matrices in SciPy, but they all have a .toarray() method.

Note that for a large matrix, this will use a tremendous amount of memory compared to a sparse matrix, so generally it's a good approach to leave it sparse for as long as possible.

like image 156
Will Avatar answered Sep 22 '22 08:09

Will