I have a fairly large sparse matrix that, I reckon, would occupy 1Gb when loaded into memory.
I don't need access to the whole matrix at all times, so some kind of memory mapping would work; it doesn't seem to be possible, however, to memory map a sparse matrix using numpy or spicy (the tools I'm familiar with).
It can easily fit into memory, but it would be a pain if I had to load it every time I run the program. Maybe some way to keep it in memory between runs?
So, what do you suggest: 1. Find a way to memory map a sparse matrix; 2. Just load the whole think into memory every time 3. ?
The following may work as a general concept, but you are going to have to figure out a lot of details... You should first make yourself familiar with CSR format, where all information for an array is stored in 3 arrays, two of length the number of non-zero entries, one of length the number of rows plus one:
>>> import scipy.sparse as sps
>>> a = sps.rand(10, 10, density=0.05, format='csr')
>>> a.toarray()
array([[ 0. , 0.46531486, 0.03849468, 0.51743202, 0. ],
[ 0. , 0.67028033, 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0.9967058 ],
[ 0. , 0. , 0. , 0. , 0. ]])
>>> a.data
array([ 0.46531486, 0.03849468, 0.51743202, 0.67028033, 0.9967058 ])
>>> a.indices
array([1, 2, 3, 1, 4])
>>> a.indptr
array([0, 3, 4, 4, 5, 5])
So a.data
has the non-zero entries, in row major order, a.indices
has the corresponding column indices of the nono-zero entries, and a.indptr
has the starting indices into the other two arrays where data for every row starts, e.g. a.indptr[3] = 4
and a.indptr[3+1] = 5
, so non-zero entries in the fourth row are a.data[4:5]
, and their column indices a.indices[4:5]
.
So you could store these three arrays in disk, and access them as memmaps, and you could then retrieve rows m through n as follows:
ip = indptr[m:n+1].copy()
d = data[ip[0]:ip[-1]]
i = indices[ip[0]:ip[-1]]
ip -= ip[0]
rows = sps.csr_matrix((d, i, ip))
As a general proof of concept:
>>> c = sps.rand(1000, 10, density=0.5, format='csr')
>>> ip = c.indptr[20:25+1].copy()
>>> d = c.data[ip[0]:ip[-1]]
>>> i = c.indices[ip[0]:ip[-1]]
>>> ip -= ip[0]
>>> rows = sps.csr_matrix((d, i, ip))
>>> rows.toarray()
array([[ 0. , 0. , 0. , 0. , 0.55683501,
0.61426248, 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0.67789204, 0. , 0.71821363,
0.01409666, 0. , 0. , 0.58965142, 0. ],
[ 0. , 0. , 0. , 0.1575835 , 0.08172986,
0.41741147, 0.72044269, 0. , 0.72148343, 0. ],
[ 0. , 0.73040998, 0.81507086, 0.13405909, 0. ,
0. , 0.82930945, 0.71799358, 0.8813616 , 0.51874795],
[ 0.43353831, 0.00658204, 0. , 0. , 0. ,
0.10863725, 0. , 0. , 0. , 0.57231074]])
>>> c[20:25].toarray()
array([[ 0. , 0. , 0. , 0. , 0.55683501,
0.61426248, 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0.67789204, 0. , 0.71821363,
0.01409666, 0. , 0. , 0.58965142, 0. ],
[ 0. , 0. , 0. , 0.1575835 , 0.08172986,
0.41741147, 0.72044269, 0. , 0.72148343, 0. ],
[ 0. , 0.73040998, 0.81507086, 0.13405909, 0. ,
0. , 0.82930945, 0.71799358, 0.8813616 , 0.51874795],
[ 0.43353831, 0.00658204, 0. , 0. , 0. ,
0.10863725, 0. , 0. , 0. , 0.57231074]])
Scipy supports different kinds of sparse matrices. But you'd have to write a routine to read it into memory. Which type you should use depends on what you want to do with it.
If your matrix is very sparse, you could save (row, column, value)
tuples to disk as binary data using the struct module. That would make the on-disk data smaller and make if easier to load, assuming portability is not an issue.
You could then read the data like this:
import struct
from functools import partial
fmt = 'IId'
size = struct.calcsize(fmt)
with open('sparse.dat', 'rb') as infile:
f = partial(infile.read, size)
for chunk in iter(f, ''):
row, col, value = struct.unpack(fmt, chunk)
# put it in your matrix here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With