Read sparse matrix in python

Tags:

I want to read a sparse matrix. When I am building ngrams using scikit learn. Its transform() gives output in sparse matrix. I want to read that matrix without doing todense().

Code:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = ['john guy','nice guy']
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(document)
transformer = vectorizer.transform(document)
print transformer

Output :

  (0, 0)    1
  (0, 1)    1
  (0, 2)    1
  (1, 0)    1
  (1, 3)    1
  (1, 4)    1

How can I read this output to get its values. I need value at (0,0), (0,1) and so on and save into list.

262

asked Nov 12 '14 14:11

iNikkz

3 Answers

The documentation for this transform method says it returns a sparse matrix, but doesn't specify the kind. Different kinds let you access the data in different ways, but it is easy to convert one to another. Your print display is the typical str for a sparse matrix.

An equivalent matrix can be generated with:

from scipy import sparse
i=[0,0,0,1,1,1]
j=[0,1,2,0,3,4]
A=sparse.csr_matrix((np.ones_like(j),(i,j)))
print(A)

producing:

  (0, 0)        1
  (0, 1)        1
  (0, 2)        1
  (1, 0)        1
  (1, 3)        1
  (1, 4)        1

A csr type can be indexed like a dense matrix:

In [32]: A[0,0]
Out[32]: 1    
In [33]: A[0,3]
Out[33]: 0

Internally the csr matrix stores its data in data, indices, indptr, which is convenient for calculation, but a bit obscure. Convert it to coo format to get data that looks just like your input:

In [34]: A.tocoo().row
Out[34]: array([0, 0, 0, 1, 1, 1], dtype=int32)

In [35]: A.tocoo().col
Out[35]: array([0, 1, 2, 0, 3, 4], dtype=int32)

Or you can convert it to a dok type, and access that data like a dictionary:

A.todok().keys()
#  dict_keys([(0, 1), (0, 0), (1, 3), (1, 0), (0, 2), (1, 4)])
A.todok().items()

produces: (Python3 here)

dict_items([((0, 1), 1), 
            ((0, 0), 1), 
            ((1, 3), 1), 
            ((1, 0), 1), 
            ((0, 2), 1), 
            ((1, 4), 1)])

The lil format stores the data as 2 lists of lists, one with the data (all 1s in this example), and the other with the row indices.

Or do you what to 'read' the data in some other way?

answered Sep 20 '22 10:09

hpaulj

This is a SciPy CSR matrix. To convert this to (row, col, value) triples, the easiest option is to convert to COO format, then get the triples from that:

>>> from scipy.sparse import rand
>>> X = rand(100, 100, format='csr')
>>> X
<100x100 sparse matrix of type '<type 'numpy.float64'>'
    with 100 stored elements in Compressed Sparse Row format>
>>> zip(X.row, X.col, X.data)[:10]
[(1, 78, 0.73843533223380842),
 (1, 91, 0.30943772717074158),
 (2, 35, 0.52635078317400608),
 (4, 75, 0.34667509458006551),
 (5, 30, 0.86482318943934389),
 (7, 74, 0.46260571098933323),
 (8, 75, 0.74193890941716234),
 (9, 72, 0.50095749482583696),
 (9, 80, 0.85906284644174613),
 (11, 66, 0.83072142899400137)]

(Note that the output is sorted.)

answered Sep 21 '22 10:09

Fred Foo

You can use data and indices as:

>>> indices=transformer.toarray()
>>> indices
array([[1, 1, 1, 0, 0],
      [1, 0, 0, 1, 1]])
>>> values=transformer.data
>>> values
array([1, 1, 1, 1, 1, 1])

answered Sep 20 '22 10:09

Irshad Bhat

Related questions
                            
                                How to avoid e-05 in python
                            
                                xpath how to get before the last element of <a>
                            
                                TA-Lib numpy "AssertionError: real is not double"
                            
                                Django loaddata - Out of Memory
                            
                                Set DJANGO_SETTINGS_MODULE as an Environment Variable in Windows permanently
                            
                                How to ignore empty lines while using .next_sibling in BeautifulSoup4 in python
                            
                                Set background color for subplot
                            
                                Why can't I detect that the tuple is empty?
                            
                                Python MySQL Connector database query with %s fails
                            
                                Python Pandas figsize not defined
                            
                                pandas pivot_table multiple aggfunc
                            
                                Requests.get in Python using "User-Agent" not simulating a browser request
                            
                                numpy genfromtxt/pandas read_csv; ignore commas within quote marks
                            
                                global name 'sqrt' not defined
                            
                                Why is type(bytes()) <'str'>
                            
                                Insert Values from dictionary into sqlite database
                            
                                Sklearn's MinMaxScaler only returns zeros
                            
                                Why does Pandas iterate over DataFrame columns by default?
                            
                                Restoring the default display context in Pandas
                            
                                Click button on website then scrape web page

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Read sparse matrix in python

Tags:

python

numpy

scipy

scikit-learn

sparse-matrix