Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Big Data Matrix manipulation

I think I have a big data (N = 1e6 and dimension = 3 ) scenario. I require to do some matrix manipulation such as einsum, matrix inversion etc several times in my code. To give an idea I want to do something like below.

import numpy.random as rd

ndata, kdata = 1e6, 1e5

x = rd.normal(0,1,(ndata, kdata,3,3))

y = rd.normal(0,1,(ndata, kdata,3,3))

For small ndata, kdata following would be efficient and convenient approach,

xy =  einsum('pqrs, pqsu -> pqru', x, y )

Since I have large ndata and kdata above approach becomes a memory bound problem so next bet would be the dot product with nested for loop over ndata and kdata as follows:

xyloop1 = np.empty((ndata, kdata, 3, 3))

for j in xrange(ndata):

    for k in xrange(kdata):

        xyloop1[j,k] =  np.dot(x[j,k], y[j,k] )

Given what I am taught for loops are nasty in python. Also, I want to use benefits of numpy so thought block matrix approach would be preferable something like following:

nstep = 200
ndiv  = ndata/nstep   

kstep = 200
kdiv  = kdata/kstep   

xyloop2 = np.empty((ndata, kdata, 3, 3))

for j in xrange(ndiv):

    ji, jf = j*nstep, (j+1)*nstep     

    for k in xrange(kdiv):

        ki, kf = k*kstep, (k+1)*kstep     

        xyloop2[ji:jf,ki:kf] =  einsum('pqrs, pqsu -> pqru', x[ji:jf,ki:kf], y[ji:jf,ki:kf] )

Also, I need these xy or xyloop1 or xyloop2 for my further calculation. So I have to write and read it after every computation. Given the bandwidth of system I/O do you reckon best approach would be approach 3 as it means less I/O and also small number of iterations in compare to approach 2? If you have any other idea or need more info please let me know.

I am new to the stack so please be gentle with me :). Any help will be highly appreciated. BTW I am trying to solve a mixture modelling problem for a big data. Thanks!

like image 788
evernomer Avatar asked Oct 19 '22 22:10

evernomer


2 Answers

While I agree with the comment that the only way to know for sure is to profile things for yourself, there are a few guiding principles that can help you write efficient numpy code on the first try. Here are a few suggestions for your problem:

  • The overhead for creating a new numpy array is of order 1000 times the cost of an add/multiply, so approach 2 should be inefficient since each call to np.dot creates an array but only performs 27 add-multiplies.
  • If you are going to have a slow for-loop in python, do it over the left most axis if possible (for C-ordered arrays).
  • It is very hard to write very general N-dimensional code efficiently, so my guess is that as series of simpler numpy calls will be more efficient than np.einsum. Try C = np.sum(A[...,:,None] * B[...,:,:], axis=-2) (this is quite speculative though).

So I would try something like the following:

xyloop2 = np.empty((ndata, kdata, 3, 3))

for i in xrange(ndata):
    xyloop2[i] = np.sum(x[i,:,:,:,None] * y[i,:,None,:,:], axis=-2)

Similar to approach 2, but much simpler (and more efficient) for-loop. Also swapped out the matrix multiply for something I think might be faster.

like image 174
kiyo Avatar answered Oct 22 '22 12:10

kiyo


Apparently, sometime einsum is efficient. Examples with p,q,r,s to be 100, 50, 3, 3

Example I:

%timeit tt=np.einsum('pqrs, pqsu->pqru',x,y)
100 loops, best of 3: 3.45 ms per loop

%timeit zz= np.sum(x[:,:,:,None,:]*y[:,:,:,None],axis=-2)
10000 loops, best of 3: 153 µs per loop

Example II:

%timeit zz= np.sum(x[:,:,:,None,:]*y[:,:,:,None,None],axis=-2)
1000 loops, best of 3: 274 µs per loop

%timeit tt=np.einsum('pqrs, pqs->pqr',x,y)
10000 loops, best of 3: 151 µs per loop

np.allclose(zz,tt)
True
like image 34
evernomer Avatar answered Oct 22 '22 12:10

evernomer