I am using Scipy to construct a large, sparse (250k X 250k) co-occurrence matrix using <code>scipy.sparse.lil_matrix</code>. Co-occurrence matrices are triangular; that is, M[i,j] == M[j,i]. Since it would be highly inefficient (and in my case, impossible) to store all the data twice, I'm currently storing data at the coordinate (i,j) where i is always smaller than j. So in other words, I have a value stored at (2,3) and no value stored at (3,2), even though (3,2) in my model should be equal to (2,3). (See the matrix below for an example) My problem is that I need to be able to randomly extract the data corresponding to a given index, but, at least the way, I'm currently doing it, half the data is in the row and half is in the column, like so: <pre class="prettyprint"><code>M = [1 2 3 4 0 5 6 7 0 0 8 9 0 0 0 10] </code></pre> So, given the above matrix, I want to be able to do a query like <code>M[1]</code>, and get back <code>[2,5,6,7]</code>. I have two questions: 1) Is there a more efficient (preferably built-in) way to do this than first querying the row, and then the column, and then concatenating the two? This is bad because whether I use CSC (column-based) or CSR (row-based) internal representation, one of the two queries is highly inefficient. 2) Am I even using the right part of Scipy? I have seen a few functions in the Scipy library that mention triangular matrices, but they seem to revolve around getting triangular matrices from a full matrix. In my case, (I think) I already have a triangular matrix, and want to manipulate it. Many thanks.

I would say that you can't have the cake and eat it too: if you want efficient storage, you cannot store full rows (as you say); if you want efficient row access, I'd say that you have to store full rows. While real performances depend on your application, you could check whether the following approach works for you: <ol> <li>You use Scipy's sparse matrices for efficient storage.</li> <li>You automatically symmetrize your matrix (there is a small recipe on StackOverflow, that works at least on regular matrices).</li> <li>You can then access its rows (or columns); whether this is efficient depends on the implementation of sparse matrices…</li> </ol>

Scipy sparse triangular matrix?

Tags:

python

matrix

scipy

I am using Scipy to construct a large, sparse (250k X 250k) co-occurrence matrix using scipy.sparse.lil_matrix. Co-occurrence matrices are triangular; that is, M[i,j] == M[j,i]. Since it would be highly inefficient (and in my case, impossible) to store all the data twice, I'm currently storing data at the coordinate (i,j) where i is always smaller than j. So in other words, I have a value stored at (2,3) and no value stored at (3,2), even though (3,2) in my model should be equal to (2,3). (See the matrix below for an example)

My problem is that I need to be able to randomly extract the data corresponding to a given index, but, at least the way, I'm currently doing it, half the data is in the row and half is in the column, like so:

Click to copy

So, given the above matrix, I want to be able to do a query like M[1], and get back [2,5,6,7]. I have two questions:

1) Is there a more efficient (preferably built-in) way to do this than first querying the row, and then the column, and then concatenating the two? This is bad because whether I use CSC (column-based) or CSR (row-based) internal representation, one of the two queries is highly inefficient.

2) Am I even using the right part of Scipy? I have seen a few functions in the Scipy library that mention triangular matrices, but they seem to revolve around getting triangular matrices from a full matrix. In my case, (I think) I already have a triangular matrix, and want to manipulate it.

Many thanks.

372

asked Jun 24 '10 03:06

gilesc

1 Answers

I would say that you can't have the cake and eat it too: if you want efficient storage, you cannot store full rows (as you say); if you want efficient row access, I'd say that you have to store full rows.

While real performances depend on your application, you could check whether the following approach works for you:

You use Scipy's sparse matrices for efficient storage.
You automatically symmetrize your matrix (there is a small recipe on StackOverflow, that works at least on regular matrices).
You can then access its rows (or columns); whether this is efficient depends on the implementation of sparse matrices…

148

answered Oct 26 '22 00:10

Eric O Lebigot

Related questions
                            
                                infer_datetime_format with parse_date taking more time
                            
                                Works with urrlib.request but doesn't work with requests
                            
                                AttributeError: module 'matplotlib' has no attribute 'get_data_path' on Visual Studio's jupyter-notebook
                            
                                Are predictions on scikit-learn models thread-safe?
                            
                                Get Instagram followers list with python script
                            
                                Can I make a discord python bot recognize when a person in a voice channel talks?
                            
                                Django QuerySet .count() is 0 and .exists() is false, even though there's an object in the QuerySet (Django Rest Framework)
                            
                                How to properly close an opened order? (can't pass a ticket number to "position" when sending an order)
                            
                                How do you manage your custom modules? [closed]
                            
                                long double returns and ctypes
                            
                                Is it possible to split a SWIG module for compilation, but rejoin it when linking?
                            
                                Getting distutils to install prebuilt compiled libraries?
                            
                                How to use IPython with IronPython
                            
                                Keeping track of changes since the last save in django models
                            
                                Testing for Inactivity in Python on Mac
                            
                                PDF Form Field Manipulation
                            
                                Is it OK to set "Cache-Control: public" when sending “304 Not Modified” for images stored in the datastore
                            
                                Is there a way to have separate pages for inline admin forms in Django?
                            
                                Django Inlines user permissions + view only - permissions issues
                            
                                Imaplib: how to delete an email from Gmail?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With