To store big matrix on disk I use numpy.memmap.
Here is a sample code to test big matrix multiplication:
import numpy as np
import time
rows= 10000 # it can be large for example 1kk
cols= 1000
#create some data in memory
data = np.arange(rows*cols, dtype='float32')
data.resize((rows,cols))
#create file on disk
fp0 = np.memmap('C:/data_0', dtype='float32', mode='w+', shape=(rows,cols))
fp1 = np.memmap('C:/data_1', dtype='float32', mode='w+', shape=(rows,cols))
fp0[:]=data[:]
fp1[:]=data[:]
#matrix transpose test
tr = np.memmap('C:/data_tr', dtype='float32', mode='w+', shape=(cols,rows))
tr= np.transpose(fp1) #memory consumption?
print fp1.shape
print tr.shape
res = np.memmap('C:/data_res', dtype='float32', mode='w+', shape=(rows,rows))
t0 = time.time()
# redifinition ? res= np.dot(fp0,tr) #takes 342 seconds on my machine, if I multiplicate matrices in RAM it takes 345 seconds (I thinks it's a strange result)
res[:]= np.dot(fp0,tr) # assignment ?
print res.shape
print (time.time() - t0)
So my questions are :
Also I interested in algorithms solving linear system of equations (SVD and others) with restricted memory usage. Maybe this algorithms called out-of-core or iterative and I think there some analogy like hard drive<->ram, gpu ram<->cpu ram, cpu ram<->cpu cache.
Also here I found some info about matrix multiplication in PyTables.
Also I found this in R but I need it for Python or Matlab.
Dask.array provides a numpy interface to large on-disk arrays using blocked algorithms and task scheduling. It can easily do out-of-core matrix multiplies and other simple-ish numpy operations.
Blocked linear algebra is harder and you might want to check out some of the academic work on this topic. Dask does support QR and SVD factorizations on tall-and-skinny matrices.
Regardless for large arrays, you really want blocked algorithms, not naive traversals which will hit disk in unpleasant ways.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With