converting scipy.sparse.csr.csr_matrix to a list of lists

Tags:

I am learning multi label classification and trying to implement the tfidf tutorial from scikit learning. I am dealing with a text corpus to calculate its tf-idf score. I am using the module sklearn.feature_extraction.text for the purpose.Using CountVectorizer and TfidfTransformer I have now my corpus vectorised and tfidf for each vocabulary. The problem is that I am having a sparse matrix now, like:

(0, 47) 0.104275891915
(0, 383)    0.084129133023
.
.
.
.
(4, 308)    0.0285015996586
(4, 199)    0.0285015996586

I want to convert this sparse.csr.csr_matrix into a list of lists so that I can get rid of the document id from the above csr_matrix and get the tfidf and vocabularyId pair like

47:0.104275891915 383:0.084129133023
.
.
.
.
308:0.0285015996586 
199:0.0285015996586

Is there any way to convert into a list of lists or any other way with which i can change the format to get tfidf-vocabularyId pair ?

565

asked Nov 19 '16 16:11

Saurabh

2 Answers

I don't know what tf-idf expects, but I may be able help with the sparse end.

Make a sparse matrix:

In [526]: M=sparse.random(4,10,.1)
In [527]: M
Out[527]: 
<4x10 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in COOrdinate format>
In [528]: print(M)
  (3, 1)    0.281301619779
  (2, 6)    0.830780358032
  (1, 1)    0.242503399296
  (2, 2)    0.190933579917

Now convert it to coo format. This is already that (I could have given the random a format parameter). In any case the values in coo format are stored in 3 arrays:

In [529]: Mc=M.tocoo()
In [530]: Mc.data
Out[530]: array([ 0.28130162,  0.83078036,  0.2425034 ,  0.19093358])
In [532]: Mc.row
Out[532]: array([3, 2, 1, 2], dtype=int32)
In [533]: Mc.col
Out[533]: array([1, 6, 1, 2], dtype=int32)

Looks like you want to ignore Mc.row, and somehow join the others.

For example as a dictionary:

In [534]: {k:v for k,v in zip(Mc.col, Mc.data)}
Out[534]: {1: 0.24250339929583264, 2: 0.19093357991697379, 6: 0.83078035803205375}

or a columns in a 2d array:

In [535]: np.column_stack((Mc.col, Mc.data))
Out[535]: 
array([[ 1.        ,  0.28130162],
       [ 6.        ,  0.83078036],
       [ 1.        ,  0.2425034 ],
       [ 2.        ,  0.19093358]])

(Also np.array((Mc.col, Mc.data)).T)

Or as just a list of arrays [Mc.col, Mc.data], or [Mc.col.tolist(), Mc.data.tolist()] list of lists, etc.

Can you take it from there?

151

answered Sep 19 '22 11:09

hpaulj

Base on Scipy I suggest to use this method:

ndarray = yourMatrix.toarray()
listOflist = ndarray.tolist()

answered Sep 22 '22 11:09

P t

Related questions
                            
                                Difference between add and iadd?
                            
                                condensing multiple if statements in python
                            
                                Using Python dateutil, how to judge a timezone string is "valid" or not?
                            
                                TypeError: sequence item 0 expected str instance, bytes found
                            
                                Unable to use google-cloud in a GAE app
                            
                                Does importing a Python module affect performance?
                            
                                Python equivalence of R's match() for indexing
                            
                                Python plot base64 string as image
                            
                                PyCharm template for python class __init__ function
                            
                                Is there a way to disable hover bar / mode bar in plotly.py?
                            
                                Elegant way of adding a set to a counter in Python
                            
                                scipy.misc.imshow RuntimeError('Could not execute image view')
                            
                                Which end of a list is the top?
                            
                                Plotting a simple 3d numpy array using matplotlib
                            
                                Is there a way to get the connection string out of sqlalchemy in log suitable format?
                            
                                Tensor Flow - LSTM - 'Tensor' object not iterable
                            
                                Pass nested dictionary location as parameter in Python
                            
                                Setting specific permission in amazon s3 boto bucket
                            
                                Generating colour image gradient using numpy
                            
                                Celery - bulk queue tasks

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

converting scipy.sparse.csr.csr_matrix to a list of lists

Tags:

python

machine-learning

scipy

scikit-learn

tf-idf

Saurabh

People also ask

2 Answers

hpaulj

P t

Recent Activity

Donate For Us