I'm new to python, coming from matlab. I have a large sparse matrix saved in matlab v7.3 (HDF5) format. I've so far found two ways of loading in the file, using h5py
and tables
. However operating on the matrix seems to be extremely slow after either. For example, in matlab:
>> whos
Name Size Bytes Class Attributes
M 11337x133338 77124408 double sparse
>> tic, sum(M(:)); toc
Elapsed time is 0.086233 seconds.
Using tables:
t = time.time()
sum(f.root.M.data)
elapsed = time.time() - t
print elapsed
35.929461956
Using h5py:
t = time.time()
sum(f["M"]["data"])
elapsed = time.time() - t
print elapsed
(I gave up waiting ...)
[EDIT]
Based on the comments from @bpgergo, I should add that I've tried converting the result loaded in by h5py
(f
) into a numpy
array or a scipy
sparse array in the following two ways:
from scipy import sparse
A = sparse.csc_matrix((f["M"]["data"], f["M"]["ir"], f["tfidf"]["jc"]))
or
data = numpy.asarray(f["M"]["data"])
ir = numpy.asarray(f["M"]["ir"])
jc = numpy.asarray(f["M"]["jc"])
A = sparse.coo_matrix(data, (ir, jc))
but both of these operations are extremely slow as well.
Is there something I'm missing here?
Most of your problem is that you're using python sum
on what's effectively a memory-mapped array (i.e. it's on disk, not in memory).
First off, you're comparing the time it takes to read things from disk to the time it takes to read things in memory. Load the array into memory first, if you want to compare to what you're doing in matlab.
Secondly, python's builtin sum
is very inefficent for numpy arrays. (Or, rather, iterating through every item of a numpy array independently is very slow, which is what python's builtin sum
is doing.) Use numpy.sum(yourarray)
or yourarray.sum()
instead for numpy arrays.
As an example:
(Using h5py
, because I'm more familiar with it.)
import h5py
import numpy as np
f = h5py.File('yourfile.hdf', 'r')
dataset = f['/M/data']
# Load the entire array into memory, like you're doing for matlab...
data = np.empty(dataset.shape, dataset.dtype)
dataset.read_direct(data)
print data.sum() #Or alternately, "np.sum(data)"
The final answer for posterity:
import tables, warnings
from scipy import sparse
def load_sparse_matrix(fname) :
warnings.simplefilter("ignore", UserWarning)
f = tables.openFile(fname)
M = sparse.csc_matrix( (f.root.M.data[...], f.root.M.ir[...], f.root.M.jc[...]) )
f.close()
return M
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With