I'm trying to implement a function in NumPy/Scipy to compute Jensen-Shannon divergence between a single (training) vector and a large number of other (observation) vectors. The observation vectors are stored in a very large (500,000x65536) Scipy sparse matrix (a dense matrix won't fit into memory).
As part of the algorithm I need to compute T+Oi for each observation vector Oi, where T is the training vector. I wasn't able to find a way to do this using NumPy's usual broadcasting rules, since sparse matrices don't seem to support those (if T is left as a dense array, Scipy tries to make the sparse matrix dense first, which runs out of memory; if I make T into a sparse matrix, T+Oi fails because the shapes are inconsistent).
Currently I am taking the grossly inefficient step of tiling the training vector into a 500,000x65536 sparse matrix:
training = sp.csr_matrix(training.astype(np.float32))
tindptr = np.arange(0, len(training.indices)*observations.shape[0]+1, len(training.indices), dtype=np.int32)
tindices = np.tile(training.indices, observations.shape[0])
tdata = np.tile(training.data, observations.shape[0])
mtraining = sp.csr_matrix((tdata, tindices, tindptr), shape=observations.shape)
But this takes up a huge amount of memory (around 6GB), when it's only storing ~1500 "real" elements. It's also pretty slow to construct.
I tried to get clever by using stride_tricks to make the CSR matrix's indptr and data members not use extra memory on the repeated data.
training = sp.csr_matrix(training)
mtraining = sp.csr_matrix(observations.shape,dtype=np.int32)
tdata = training.data
vdata = np.lib.stride_tricks.as_strided(tdata, (mtraining.shape[0], tdata.size), (0, tdata.itemsize))
indices = training.indices
vindices = np.lib.stride_tricks.as_strided(indices, (mtraining.shape[0], indices.size), (0, indices.itemsize))
mtraining.indptr = np.arange(0, len(indices)*mtraining.shape[0]+1, len(indices), dtype=np.int32)
mtraining.data = vdata
mtraining.indices = vindices
But this doesn't work because the strided views mtraining.data and mtraining.indices are the wrong shape (and according to this answer there's no way to make it the right shape). Trying to make them look flat using the .flat iterator fails because it doesn't look enough like an array (e.g. it doesn't have a dtype member), and using the flatten() method ends up making a copy.
Is there any way to get this done?
Your other option, which I hadn't even considered, is to implement the sum in sparse format yourself, so that you can take full advantage of the periodic nature of your array. This can be very easy to do, if you abuse this peculiar behavior of scipy's sparse matrices:
>>> a = sps.csr_matrix([1,2,3,4])
>>> a.data
array([1, 2, 3, 4])
>>> a.indices
array([0, 1, 2, 3])
>>> a.indptr
array([0, 4])
>>> b = sps.csr_matrix((np.array([1, 2, 3, 4, 5]),
... np.array([0, 1, 2, 3, 0]),
... np.array([0, 5])), shape=(1, 4))
>>> b
<1x4 sparse matrix of type '<type 'numpy.int32'>'
with 5 stored elements in Compressed Sparse Row format>
>>> b.todense()
matrix([[6, 2, 3, 4]])
So you don't even have to look for coincidences between your training vector and each of the rows of the observation matrix to add them up: just cram all the data with the right pointers in there, and what needs getting summed, will get summed when the data is accessed.
EDIT
Given the slowness of the first code, you can trade memory for speed as follows:
def csr_add_sparse_vec(sps_mat, sps_vec) :
"""Adds a sparse vector to every row of a sparse matrix"""
# No checks done, but both arguments should be sparse matrices in CSR
# format, both should have the same number of columns, and the vector
# should be a vector and have only one row.
rows, cols = sps_mat.shape
nnz_vec = len(sps_vec.data)
nnz_per_row = np.diff(sps_mat.indptr)
longest_row = np.max(nnz_per_row)
old_data = np.zeros((rows * longest_row,), dtype=sps_mat.data.dtype)
old_cols = np.zeros((rows * longest_row,), dtype=sps_mat.indices.dtype)
data_idx = np.arange(longest_row) < nnz_per_row[:, None]
data_idx = data_idx.reshape(-1)
old_data[data_idx] = sps_mat.data
old_cols[data_idx] = sps_mat.indices
old_data = old_data.reshape(rows, -1)
old_cols = old_cols.reshape(rows, -1)
new_data = np.zeros((rows, longest_row + nnz_vec,),
dtype=sps_mat.data.dtype)
new_data[:, :longest_row] = old_data
del old_data
new_cols = np.zeros((rows, longest_row + nnz_vec,),
dtype=sps_mat.indices.dtype)
new_cols[:, :longest_row] = old_cols
del old_cols
new_data[:, longest_row:] = sps_vec.data
new_cols[:, longest_row:] = sps_vec.indices
new_data = new_data.reshape(-1)
new_cols = new_cols.reshape(-1)
new_pointer = np.arange(0, (rows + 1) * (longest_row + nnz_vec),
longest_row + nnz_vec)
ret = sps.csr_matrix((new_data, new_cols, new_pointer),
shape=sps_mat.shape)
ret.eliminate_zeros()
return ret
It is not as fast as before, but it can do 10,000 rows in about 1 s.:
In [2]: a
Out[2]:
<10000x65536 sparse matrix of type '<type 'numpy.float64'>'
with 15000000 stored elements in Compressed Sparse Row format>
In [3]: b
Out[3]:
<1x65536 sparse matrix of type '<type 'numpy.float64'>'
with 1500 stored elements in Compressed Sparse Row format>
In [4]: csr_add_sparse_vec(a, b)
Out[4]:
<10000x65536 sparse matrix of type '<type 'numpy.float64'>'
with 30000000 stored elements in Compressed Sparse Row format>
In [5]: %timeit csr_add_sparse_vec(a, b)
1 loops, best of 3: 956 ms per loop
EDIT This code is very, very slow
def csr_add_sparse_vec(sps_mat, sps_vec) :
"""Adds a sparse vector to every row of a sparse matrix"""
# No checks done, but both arguments should be sparse matrices in CSR
# format, both should have the same number of columns, and the vector
# should be a vector and have only one row.
rows, cols = sps_mat.shape
new_data = sps_mat.data
new_pointer = sps_mat.indptr.copy()
new_cols = sps_mat.indices
aux_idx = np.arange(rows + 1)
for value, col in itertools.izip(sps_vec.data, sps_vec.indices) :
new_data = np.insert(new_data, new_pointer[1:], [value] * rows)
new_cols = np.insert(new_cols, new_pointer[1:], [col] * rows)
new_pointer += aux_idx
return sps.csr_matrix((new_data, new_cols, new_pointer),
shape=sps_mat.shape)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With