I am using python scikit-learn
for document clustering and I have a sparse matrix stored in a dict
object:
For example:
doc_term_dict = { ('d1','t1'): 12, \
('d2','t3'): 10, \
('d3','t2'): 5 \
} # from mysql data table
<type 'dict'>
I want to use scikit-learn
to do the clustering where the input matrix type is scipy.sparse.csr.csr_matrix
Example:
(0, 2164) 0.245793088885
(0, 2076) 0.205702177467
(0, 2037) 0.193810934784
(0, 2005) 0.14547028437
(0, 1953) 0.153720023365
...
<class 'scipy.sparse.csr.csr_matrix'>
I can't find a way to convert dict
to this csr-matrix (I have never used scipy
.)
Pretty straightforward. First read the dictionary and convert the keys to the appropriate row and column. Scipy supports (and recommends for this purpose) the COO-rdinate format for sparse matrices.
Pass it data
, row
, and column
, where A[row[k], column[k] = data[k]
(for all k) defines the matrix. Then let Scipy do the conversion to CSR.
Please check, that I have rows and columns in the way you want them, I might have them transposed. I also assumed that the input would be 1-indexed.
My code below prints:
(0, 0) 12
(1, 2) 10
(2, 1) 5
Code:
#!/usr/bin/env python3
#http://stackoverflow.com/questions/26335059/converting-python-sparse-matrix-dict-to-scipy-sparse-matrix
from scipy.sparse import csr_matrix, coo_matrix
def convert(term_dict):
''' Convert a dictionary with elements of form ('d1', 't1'): 12 to a CSR type matrix.
The element ('d1', 't1'): 12 becomes entry (0, 0) = 12.
* Conversion from 1-indexed to 0-indexed.
* d is row
* t is column.
'''
# Create the appropriate format for the COO format.
data = []
row = []
col = []
for k, v in term_dict.items():
r = int(k[0][1:])
c = int(k[1][1:])
data.append(v)
row.append(r-1)
col.append(c-1)
# Create the COO-matrix
coo = coo_matrix((data,(row,col)))
# Let Scipy convert COO to CSR format and return
return csr_matrix(coo)
if __name__=='__main__':
doc_term_dict = { ('d1','t1'): 12, \
('d2','t3'): 10, \
('d3','t2'): 5 \
}
print(convert(doc_term_dict))
We can make @Unapiedra's (excellent) answer a little more sparse:
from scipy.sparse import csr_matrix
def _dict_to_csr(term_dict):
term_dict_v = list(term_dict.itervalues())
term_dict_k = list(term_dict.iterkeys())
shape = list(repeat(np.asarray(term_dict_k).max() + 1,2))
csr = csr_matrix((term_dict_v, zip(*term_dict_k)), shape = shape)
return csr
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With