I am learning multi label classification and trying to implement the tfidf tutorial from scikit learning. I am dealing with a text corpus to calculate its tf-idf score. I am using the module sklearn.feature_extraction.text for the purpose.Using CountVectorizer and TfidfTransformer I have now my corpus vectorised and tfidf for each vocabulary. The problem is that I am having a sparse matrix now, like:
(0, 47) 0.104275891915
(0, 383) 0.084129133023
.
.
.
.
(4, 308) 0.0285015996586
(4, 199) 0.0285015996586
I want to convert this sparse.csr.csr_matrix into a list of lists so that I can get rid of the document id from the above csr_matrix and get the tfidf and vocabularyId pair like
47:0.104275891915 383:0.084129133023
.
.
.
.
308:0.0285015996586
199:0.0285015996586
Is there any way to convert into a list of lists or any other way with which i can change the format to get tfidf-vocabularyId pair ?
Sparse matrices can be used in arithmetic operations: they support addition, subtraction, multiplication, division, and matrix power. efficient arithmetic operations CSR + CSR, CSR * CSR, etc.
So we first convert the COO sparse matrix to CSR (Compressed Sparse Row format) matrix using tocsr() function. And then we can slice the sparse matrix rows using the row indices array we created. We can see that after slicing we get a sparse matrix of size 3×5 in CSR format.
The compressed sparse row (CSR) or compressed row storage (CRS) or Yale format represents a matrix M by three (one-dimensional) arrays, that respectively contain nonzero values, the extents of rows, and column indices. It is similar to COO, but compresses the row indices, hence the name.
I don't know what tf-idf
expects, but I may be able help with the sparse end.
Make a sparse matrix:
In [526]: M=sparse.random(4,10,.1)
In [527]: M
Out[527]:
<4x10 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in COOrdinate format>
In [528]: print(M)
(3, 1) 0.281301619779
(2, 6) 0.830780358032
(1, 1) 0.242503399296
(2, 2) 0.190933579917
Now convert it to coo
format. This is already that (I could have given the random
a format parameter). In any case the values in coo
format are stored in 3 arrays:
In [529]: Mc=M.tocoo()
In [530]: Mc.data
Out[530]: array([ 0.28130162, 0.83078036, 0.2425034 , 0.19093358])
In [532]: Mc.row
Out[532]: array([3, 2, 1, 2], dtype=int32)
In [533]: Mc.col
Out[533]: array([1, 6, 1, 2], dtype=int32)
Looks like you want to ignore Mc.row
, and somehow join the others.
For example as a dictionary:
In [534]: {k:v for k,v in zip(Mc.col, Mc.data)}
Out[534]: {1: 0.24250339929583264, 2: 0.19093357991697379, 6: 0.83078035803205375}
or a columns in a 2d array:
In [535]: np.column_stack((Mc.col, Mc.data))
Out[535]:
array([[ 1. , 0.28130162],
[ 6. , 0.83078036],
[ 1. , 0.2425034 ],
[ 2. , 0.19093358]])
(Also np.array((Mc.col, Mc.data)).T
)
Or as just a list of arrays [Mc.col, Mc.data]
, or [Mc.col.tolist(), Mc.data.tolist()]
list of lists, etc.
Can you take it from there?
Base on Scipy I suggest to use this method:
ndarray = yourMatrix.toarray()
listOflist = ndarray.tolist()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With