Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

converting scipy.sparse.csr.csr_matrix to a list of lists

I am learning multi label classification and trying to implement the tfidf tutorial from scikit learning. I am dealing with a text corpus to calculate its tf-idf score. I am using the module sklearn.feature_extraction.text for the purpose.Using CountVectorizer and TfidfTransformer I have now my corpus vectorised and tfidf for each vocabulary. The problem is that I am having a sparse matrix now, like:

(0, 47) 0.104275891915
(0, 383)    0.084129133023
.
.
.
.
(4, 308)    0.0285015996586
(4, 199)    0.0285015996586

I want to convert this sparse.csr.csr_matrix into a list of lists so that I can get rid of the document id from the above csr_matrix and get the tfidf and vocabularyId pair like

47:0.104275891915 383:0.084129133023
.
.
.
.
308:0.0285015996586 
199:0.0285015996586

Is there any way to convert into a list of lists or any other way with which i can change the format to get tfidf-vocabularyId pair ?

like image 565
Saurabh Avatar asked Nov 19 '16 16:11

Saurabh


People also ask

What does SciPy sparse Csr_matrix do?

Sparse matrices can be used in arithmetic operations: they support addition, subtraction, multiplication, division, and matrix power. efficient arithmetic operations CSR + CSR, CSR * CSR, etc.

How do you split a sparse matrix in python?

So we first convert the COO sparse matrix to CSR (Compressed Sparse Row format) matrix using tocsr() function. And then we can slice the sparse matrix rows using the row indices array we created. We can see that after slicing we get a sparse matrix of size 3×5 in CSR format.

What is a CSR sparse matrix?

The compressed sparse row (CSR) or compressed row storage (CRS) or Yale format represents a matrix M by three (one-dimensional) arrays, that respectively contain nonzero values, the extents of rows, and column indices. It is similar to COO, but compresses the row indices, hence the name.


2 Answers

I don't know what tf-idf expects, but I may be able help with the sparse end.

Make a sparse matrix:

In [526]: M=sparse.random(4,10,.1)
In [527]: M
Out[527]: 
<4x10 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in COOrdinate format>
In [528]: print(M)
  (3, 1)    0.281301619779
  (2, 6)    0.830780358032
  (1, 1)    0.242503399296
  (2, 2)    0.190933579917

Now convert it to coo format. This is already that (I could have given the random a format parameter). In any case the values in coo format are stored in 3 arrays:

In [529]: Mc=M.tocoo()
In [530]: Mc.data
Out[530]: array([ 0.28130162,  0.83078036,  0.2425034 ,  0.19093358])
In [532]: Mc.row
Out[532]: array([3, 2, 1, 2], dtype=int32)
In [533]: Mc.col
Out[533]: array([1, 6, 1, 2], dtype=int32)

Looks like you want to ignore Mc.row, and somehow join the others.

For example as a dictionary:

In [534]: {k:v for k,v in zip(Mc.col, Mc.data)}
Out[534]: {1: 0.24250339929583264, 2: 0.19093357991697379, 6: 0.83078035803205375}

or a columns in a 2d array:

In [535]: np.column_stack((Mc.col, Mc.data))
Out[535]: 
array([[ 1.        ,  0.28130162],
       [ 6.        ,  0.83078036],
       [ 1.        ,  0.2425034 ],
       [ 2.        ,  0.19093358]])

(Also np.array((Mc.col, Mc.data)).T)

Or as just a list of arrays [Mc.col, Mc.data], or [Mc.col.tolist(), Mc.data.tolist()] list of lists, etc.

Can you take it from there?

like image 151
hpaulj Avatar answered Sep 19 '22 11:09

hpaulj


Base on Scipy I suggest to use this method:

ndarray = yourMatrix.toarray()
listOflist = ndarray.tolist()
like image 34
P t Avatar answered Sep 22 '22 11:09

P t