Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read sparse matrix in python

I want to read a sparse matrix. When I am building ngrams using scikit learn. Its transform() gives output in sparse matrix. I want to read that matrix without doing todense().

Code:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = ['john guy','nice guy']
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(document)
transformer = vectorizer.transform(document)
print transformer

Output :

  (0, 0)    1
  (0, 1)    1
  (0, 2)    1
  (1, 0)    1
  (1, 3)    1
  (1, 4)    1

How can I read this output to get its values. I need value at (0,0), (0,1) and so on and save into list.

like image 262
iNikkz Avatar asked Nov 12 '14 14:11

iNikkz


People also ask

What do we need to create a sparse matrix in Python?

We basically need the co-ordinates of non-zero elements in the sparse matrix. To create a coo_matrix we need 3 one-dimensional numpy arrays. The first array represents the row indices, the second array represents column indices and the third array represents non-zero data in the element.

How is the data stored in a sparse matrix?

We can also print the small sparse matrix to see how the data is stored. We can see that in the csr sparse matrix , we have only nonzero elements. Also the elements are stored row wise, leaving any zero element. The toy example showed how to create sparse matrix from a full matrix in Python.

What are the different types of sparse matrices available?

There are 7 different types of sparse matrices available. dok_matrix: Dictionary Of Keys based sparse matrix. How to Choose the Right Sparse Matrix? Each of these sparse matrix are efficient and fast for specific operations. For example, if you want to construct a new sparse matrix from scratch lil_matrix or dok_matrix are efficient.

What happens when you represent a sparse matrix in a 2D array?

Representing a sparse matrix by a 2D array leads to wastage of lots of memory as zeroes in the matrix are of no use in most of the cases. So, instead of storing zeroes with non-zero elements, we only store non-zero elements.


3 Answers

The documentation for this transform method says it returns a sparse matrix, but doesn't specify the kind. Different kinds let you access the data in different ways, but it is easy to convert one to another. Your print display is the typical str for a sparse matrix.

An equivalent matrix can be generated with:

from scipy import sparse
i=[0,0,0,1,1,1]
j=[0,1,2,0,3,4]
A=sparse.csr_matrix((np.ones_like(j),(i,j)))
print(A)

producing:

  (0, 0)        1
  (0, 1)        1
  (0, 2)        1
  (1, 0)        1
  (1, 3)        1
  (1, 4)        1

A csr type can be indexed like a dense matrix:

In [32]: A[0,0]
Out[32]: 1    
In [33]: A[0,3]
Out[33]: 0

Internally the csr matrix stores its data in data, indices, indptr, which is convenient for calculation, but a bit obscure. Convert it to coo format to get data that looks just like your input:

In [34]: A.tocoo().row
Out[34]: array([0, 0, 0, 1, 1, 1], dtype=int32)

In [35]: A.tocoo().col
Out[35]: array([0, 1, 2, 0, 3, 4], dtype=int32)

Or you can convert it to a dok type, and access that data like a dictionary:

A.todok().keys()
#  dict_keys([(0, 1), (0, 0), (1, 3), (1, 0), (0, 2), (1, 4)])
A.todok().items()

produces: (Python3 here)

dict_items([((0, 1), 1), 
            ((0, 0), 1), 
            ((1, 3), 1), 
            ((1, 0), 1), 
            ((0, 2), 1), 
            ((1, 4), 1)])

The lil format stores the data as 2 lists of lists, one with the data (all 1s in this example), and the other with the row indices.

Or do you what to 'read' the data in some other way?

like image 60
hpaulj Avatar answered Sep 20 '22 10:09

hpaulj


This is a SciPy CSR matrix. To convert this to (row, col, value) triples, the easiest option is to convert to COO format, then get the triples from that:

>>> from scipy.sparse import rand
>>> X = rand(100, 100, format='csr')
>>> X
<100x100 sparse matrix of type '<type 'numpy.float64'>'
    with 100 stored elements in Compressed Sparse Row format>
>>> zip(X.row, X.col, X.data)[:10]
[(1, 78, 0.73843533223380842),
 (1, 91, 0.30943772717074158),
 (2, 35, 0.52635078317400608),
 (4, 75, 0.34667509458006551),
 (5, 30, 0.86482318943934389),
 (7, 74, 0.46260571098933323),
 (8, 75, 0.74193890941716234),
 (9, 72, 0.50095749482583696),
 (9, 80, 0.85906284644174613),
 (11, 66, 0.83072142899400137)]

(Note that the output is sorted.)

like image 24
Fred Foo Avatar answered Sep 21 '22 10:09

Fred Foo


You can use data and indices as:

>>> indices=transformer.toarray()
>>> indices
array([[1, 1, 1, 0, 0],
      [1, 0, 0, 1, 1]])
>>> values=transformer.data
>>> values
array([1, 1, 1, 1, 1, 1])
like image 37
Irshad Bhat Avatar answered Sep 20 '22 10:09

Irshad Bhat