I want to read a sparse matrix. When I am building ngrams using scikit learn. Its transform() gives output in sparse matrix. I want to read that matrix without doing todense().
Code:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = ['john guy','nice guy']
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(document)
transformer = vectorizer.transform(document)
print transformer
Output :
(0, 0) 1
(0, 1) 1
(0, 2) 1
(1, 0) 1
(1, 3) 1
(1, 4) 1
How can I read this output to get its values. I need value at (0,0), (0,1) and so on and save into list.
We basically need the co-ordinates of non-zero elements in the sparse matrix. To create a coo_matrix we need 3 one-dimensional numpy arrays. The first array represents the row indices, the second array represents column indices and the third array represents non-zero data in the element.
We can also print the small sparse matrix to see how the data is stored. We can see that in the csr sparse matrix , we have only nonzero elements. Also the elements are stored row wise, leaving any zero element. The toy example showed how to create sparse matrix from a full matrix in Python.
There are 7 different types of sparse matrices available. dok_matrix: Dictionary Of Keys based sparse matrix. How to Choose the Right Sparse Matrix? Each of these sparse matrix are efficient and fast for specific operations. For example, if you want to construct a new sparse matrix from scratch lil_matrix or dok_matrix are efficient.
Representing a sparse matrix by a 2D array leads to wastage of lots of memory as zeroes in the matrix are of no use in most of the cases. So, instead of storing zeroes with non-zero elements, we only store non-zero elements.
The documentation for this transform
method says it returns a sparse matrix, but doesn't specify the kind. Different kinds let you access the data in different ways, but it is easy to convert one to another. Your print display is the typical str
for a sparse matrix.
An equivalent matrix can be generated with:
from scipy import sparse
i=[0,0,0,1,1,1]
j=[0,1,2,0,3,4]
A=sparse.csr_matrix((np.ones_like(j),(i,j)))
print(A)
producing:
(0, 0) 1
(0, 1) 1
(0, 2) 1
(1, 0) 1
(1, 3) 1
(1, 4) 1
A csr
type can be indexed like a dense matrix:
In [32]: A[0,0]
Out[32]: 1
In [33]: A[0,3]
Out[33]: 0
Internally the csr
matrix stores its data in data
, indices
, indptr
, which is convenient for calculation, but a bit obscure. Convert it to coo
format to get data that looks just like your input:
In [34]: A.tocoo().row
Out[34]: array([0, 0, 0, 1, 1, 1], dtype=int32)
In [35]: A.tocoo().col
Out[35]: array([0, 1, 2, 0, 3, 4], dtype=int32)
Or you can convert it to a dok
type, and access that data like a dictionary:
A.todok().keys()
# dict_keys([(0, 1), (0, 0), (1, 3), (1, 0), (0, 2), (1, 4)])
A.todok().items()
produces: (Python3 here)
dict_items([((0, 1), 1),
((0, 0), 1),
((1, 3), 1),
((1, 0), 1),
((0, 2), 1),
((1, 4), 1)])
The lil
format stores the data as 2 lists of lists, one with the data (all 1s in this example), and the other with the row indices.
Or do you what to 'read' the data in some other way?
This is a SciPy CSR matrix. To convert this to (row, col, value) triples, the easiest option is to convert to COO format, then get the triples from that:
>>> from scipy.sparse import rand
>>> X = rand(100, 100, format='csr')
>>> X
<100x100 sparse matrix of type '<type 'numpy.float64'>'
with 100 stored elements in Compressed Sparse Row format>
>>> zip(X.row, X.col, X.data)[:10]
[(1, 78, 0.73843533223380842),
(1, 91, 0.30943772717074158),
(2, 35, 0.52635078317400608),
(4, 75, 0.34667509458006551),
(5, 30, 0.86482318943934389),
(7, 74, 0.46260571098933323),
(8, 75, 0.74193890941716234),
(9, 72, 0.50095749482583696),
(9, 80, 0.85906284644174613),
(11, 66, 0.83072142899400137)]
(Note that the output is sorted.)
You can use data
and indices
as:
>>> indices=transformer.toarray()
>>> indices
array([[1, 1, 1, 0, 0],
[1, 0, 0, 1, 1]])
>>> values=transformer.data
>>> values
array([1, 1, 1, 1, 1, 1])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With