How to make TF-IDF matrix dense?

Tags:

I am using TfidfVectorizer to convert a collection of raw documents to a matrix of TF-IDF features, which I then plan to input into a k-means algorithm (which I will implement). In that algorithm I will have to compute distances between centroids (categories of articles) and data points (articles). I am going to use Euclidean distance, so I need these two entities to be of same dimension, in my case max_features. Here is what I have:

tfidf = TfidfVectorizer(max_features=10, strip_accents='unicode', analyzer='word', stop_words=stop_words.extra_stopwords, lowercase=True, use_idf=True)
X = tfidf.fit_transform(data['Content']) # the matrix articles x max_features(=words)
for i, row in enumerate(X):
    print X[i]

However X seems to be a sparse(?) matrix, since the output is:

  (0, 9)    0.723131915847
  (0, 8)    0.090245047798
  (0, 6)    0.117465276892
  (0, 4)    0.379981697363
  (0, 3)    0.235921470645
  (0, 2)    0.0968780456528
  (0, 1)    0.495689001273

  (0, 9)    0.624910843051
  (0, 8)    0.545911131362
  (0, 7)    0.160545991411
  (0, 5)    0.49900042174
  (0, 4)    0.191549050212

  ...

Where I think the (0, col) states the column index in the matrix, which actually like an array, where every cell points to a list.

How do I convert this matrix to a dense one (so that every row has the same number of columns)?

>print type(X)
<class 'scipy.sparse.csr.csr_matrix'>

905

asked Jan 31 '16 01:01

gsamaras

1 Answers

This should be as simple as:

dense = X.toarray()

TfIdfVectorizer.fit_transform() is returning a SciPy csr_matrix() (Compressed Sparse Row Matrix), which has a toarray() method just for this purpose. There are several formats of sparse matrices in SciPy, but they all have a .toarray() method.

Note that for a large matrix, this will use a tremendous amount of memory compared to a sparse matrix, so generally it's a good approach to leave it sparse for as long as possible.

156

answered Sep 22 '22 08:09

Will

Related questions
                            
                                dir() without built-in methods
                            
                                Jinja2 Exception Handling
                            
                                Fast/Optimize N-gram implementations in python
                            
                                Accessing the list while being sorted
                            
                                Python MySQL Connector executing second sql statement within cursor loop?
                            
                                How to use gdb python debugging extension inside virtualenv
                            
                                How to center text vertically inside a text input in kv file?
                            
                                FastCGI WSGI library in Python 3?
                            
                                How would one decorate an inherited method in the child class?
                            
                                Extremely slow import of matplotlib afm
                            
                                Patching sshuttle's firewall.py -- IPFW to PF [closed]
                            
                                why do perl, ruby use /dev/urandom
                            
                                Generate flattened PDF with Python
                            
                                Using MultilabelBinarizer on test data with labels not in the training set
                            
                                Saving XML using ETree in Python. It's not retaining namespaces, and adding ns0, ns1 and removing xmlns tags
                            
                                Is there a standard way to get the user config directory in python
                            
                                PyCharm doesn't detect interpreter
                            
                                Python lxml Subelement with text value?
                            
                                Pandas sorting by value and then by index
                            
                                Python click: Make some options hidden

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to make TF-IDF matrix dense?

Tags:

python

cluster-analysis

scikit-learn

sparse-matrix

tf-idf

gsamaras

People also ask

1 Answers

Will

Recent Activity

Donate For Us