Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Store and retrieve large sparse matrix [closed]

I have a fairly large sparse matrix that, I reckon, would occupy 1Gb when loaded into memory.

I don't need access to the whole matrix at all times, so some kind of memory mapping would work; it doesn't seem to be possible, however, to memory map a sparse matrix using numpy or spicy (the tools I'm familiar with).

It can easily fit into memory, but it would be a pain if I had to load it every time I run the program. Maybe some way to keep it in memory between runs?

So, what do you suggest: 1. Find a way to memory map a sparse matrix; 2. Just load the whole think into memory every time 3. ?

like image 776
user1491915 Avatar asked Apr 16 '13 14:04

user1491915


2 Answers

The following may work as a general concept, but you are going to have to figure out a lot of details... You should first make yourself familiar with CSR format, where all information for an array is stored in 3 arrays, two of length the number of non-zero entries, one of length the number of rows plus one:

>>> import scipy.sparse as sps
>>> a = sps.rand(10, 10, density=0.05, format='csr')
>>> a.toarray()
array([[ 0.        ,  0.46531486,  0.03849468,  0.51743202,  0.        ],
       [ 0.        ,  0.67028033,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.9967058 ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])
>>> a.data
array([ 0.46531486,  0.03849468,  0.51743202,  0.67028033,  0.9967058 ])
>>> a.indices
array([1, 2, 3, 1, 4])
>>> a.indptr
array([0, 3, 4, 4, 5, 5])

So a.data has the non-zero entries, in row major order, a.indices has the corresponding column indices of the nono-zero entries, and a.indptr has the starting indices into the other two arrays where data for every row starts, e.g. a.indptr[3] = 4 and a.indptr[3+1] = 5, so non-zero entries in the fourth row are a.data[4:5], and their column indices a.indices[4:5].

So you could store these three arrays in disk, and access them as memmaps, and you could then retrieve rows m through n as follows:

ip = indptr[m:n+1].copy()
d = data[ip[0]:ip[-1]]
i = indices[ip[0]:ip[-1]]
ip -= ip[0]
rows = sps.csr_matrix((d, i, ip))

As a general proof of concept:

>>> c = sps.rand(1000, 10, density=0.5, format='csr')
>>> ip = c.indptr[20:25+1].copy()
>>> d = c.data[ip[0]:ip[-1]]
>>> i = c.indices[ip[0]:ip[-1]]
>>> ip -= ip[0]
>>> rows = sps.csr_matrix((d, i, ip))
>>> rows.toarray()
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.55683501,
         0.61426248,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.67789204,  0.        ,  0.71821363,
         0.01409666,  0.        ,  0.        ,  0.58965142,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.1575835 ,  0.08172986,
         0.41741147,  0.72044269,  0.        ,  0.72148343,  0.        ],
       [ 0.        ,  0.73040998,  0.81507086,  0.13405909,  0.        ,
         0.        ,  0.82930945,  0.71799358,  0.8813616 ,  0.51874795],
       [ 0.43353831,  0.00658204,  0.        ,  0.        ,  0.        ,
         0.10863725,  0.        ,  0.        ,  0.        ,  0.57231074]])
>>> c[20:25].toarray()
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.55683501,
         0.61426248,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.67789204,  0.        ,  0.71821363,
         0.01409666,  0.        ,  0.        ,  0.58965142,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.1575835 ,  0.08172986,
         0.41741147,  0.72044269,  0.        ,  0.72148343,  0.        ],
       [ 0.        ,  0.73040998,  0.81507086,  0.13405909,  0.        ,
         0.        ,  0.82930945,  0.71799358,  0.8813616 ,  0.51874795],
       [ 0.43353831,  0.00658204,  0.        ,  0.        ,  0.        ,
         0.10863725,  0.        ,  0.        ,  0.        ,  0.57231074]])
like image 77
Jaime Avatar answered Oct 29 '22 16:10

Jaime


Scipy supports different kinds of sparse matrices. But you'd have to write a routine to read it into memory. Which type you should use depends on what you want to do with it.

If your matrix is very sparse, you could save (row, column, value) tuples to disk as binary data using the struct module. That would make the on-disk data smaller and make if easier to load, assuming portability is not an issue.

You could then read the data like this:

import struct
from functools import partial

fmt = 'IId'
size = struct.calcsize(fmt)

with open('sparse.dat', 'rb') as infile:
    f = partial(infile.read, size)
    for chunk in iter(f, ''):
        row, col, value = struct.unpack(fmt, chunk)
        # put it in your matrix here
like image 36
Roland Smith Avatar answered Oct 29 '22 16:10

Roland Smith