Difference on performance between numpy and matlab

Tags:

I am computing the backpropagation algorithm for a sparse autoencoder. I have implemented it in python using numpy and in matlab. The code is almost the same, but the performance is very different. The time matlab takes to complete the task is 0.252454 seconds while numpy 0.973672151566, that is almost four times more. I will call this code several times later in a minimization problem so this difference leads to several minutes of delay between the implementations. Is this a normal behaviour? How could I improve the performance in numpy?

Numpy implementation:

Sparse.rho is a tuning parameter, sparse.nodes are the number of nodes in the hidden layer (25), sparse.input (64) the number of nodes in the input layer, theta1 and theta2 are the weight matrices for the first and second layer respectively with dimensions 25x64 and 64x25, m is equal to 10000, rhoest has a dimension of (25,), x has a dimension of 10000x64, a3 10000x64 and a2 10000x25.

UPDATE: I have introduced changes in the code following some of the ideas of the responses. The performance is now numpy: 0.65 vs matlab: 0.25.

partial_j1 = np.zeros(sparse.theta1.shape) partial_j2 = np.zeros(sparse.theta2.shape) partial_b1 = np.zeros(sparse.b1.shape) partial_b2 = np.zeros(sparse.b2.shape) t = time.time()  delta3t = (-(x-a3)*a3*(1-a3)).T  for i in range(m):      delta3 = delta3t[:,i:(i+1)]     sum1 =  np.dot(sparse.theta2.T,delta3)     delta2 = ( sum1 + sum2 ) * a2[i:(i+1),:].T* (1 - a2[i:(i+1),:].T)     partial_j1 += np.dot(delta2, a1[i:(i+1),:])     partial_j2 += np.dot(delta3, a2[i:(i+1),:])     partial_b1 += delta2     partial_b2 += delta3  print "Backprop time:", time.time() -t

Matlab implementation:

tic for i = 1:m      delta3 = -(data(i,:)-a3(i,:)).*a3(i,:).*(1 - a3(i,:));     delta3 = delta3.';     sum1 =  W2.'*delta3;     sum2 = beta*(-sparsityParam./rhoest + (1 - sparsityParam) ./ (1.0 - rhoest) );     delta2 = ( sum1 + sum2 ) .* a2(i,:).' .* (1 - a2(i,:).');     W1grad = W1grad + delta2* a1(i,:);     W2grad = W2grad + delta3* a2(i,:);     b1grad = b1grad + delta2;     b2grad = b2grad + delta3; end toc

998

asked Aug 29 '13 16:08

pabaldonedo

1 Answers

It would be wrong to say "Matlab is always faster than NumPy" or vice versa. Often their performance is comparable. When using NumPy, to get good performance you have to keep in mind that NumPy's speed comes from calling underlying functions written in C/C++/Fortran. It performs well when you apply those functions to whole arrays. In general, you get poorer performance when you call those NumPy function on smaller arrays or scalars in a Python loop.

What's wrong with a Python loop you ask? Every iteration through the Python loop is a call to a next method. Every use of [] indexing is a call to a __getitem__ method. Every += is a call to __iadd__. Every dotted attribute lookup (such as in like np.dot) involves function calls. Those function calls add up to a significant hinderance to speed. These hooks give Python expressive power -- indexing for strings means something different than indexing for dicts for example. Same syntax, different meanings. The magic is accomplished by giving the objects different __getitem__ methods.

But that expressive power comes at a cost in speed. So when you don't need all that dynamic expressivity, to get better performance, try to limit yourself to NumPy function calls on whole arrays.

So, remove the for-loop; use "vectorized" equations when possible. For example, instead of

for i in range(m):     delta3 = -(x[i,:]-a3[i,:])*a3[i,:]* (1 - a3[i,:])

you can compute delta3 for each i all at once:

delta3 = -(x-a3)*a3*(1-a3)

Whereas in the for-loop delta3 is a vector, using the vectorized equation delta3 is a matrix.

Some of the computations in the for-loop do not depend on i and therefore should be lifted outside the loop. For example, sum2 looks like a constant:

sum2 = sparse.beta*(-float(sparse.rho)/rhoest + float(1.0 - sparse.rho) / (1.0 - rhoest) )

Here is a runnable example with an alternative implementation (alt) of your code (orig).

My timeit benchmark shows a 6.8x improvement in speed:

In [52]: %timeit orig() 1 loops, best of 3: 495 ms per loop  In [53]: %timeit alt() 10 loops, best of 3: 72.6 ms per loop

import numpy as np   class Bunch(object):     """ http://code.activestate.com/recipes/52308 """     def __init__(self, **kwds):         self.__dict__.update(kwds)  m, n, p = 10 ** 4, 64, 25  sparse = Bunch(     theta1=np.random.random((p, n)),     theta2=np.random.random((n, p)),     b1=np.random.random((p, 1)),     b2=np.random.random((n, 1)), )  x = np.random.random((m, n)) a3 = np.random.random((m, n)) a2 = np.random.random((m, p)) a1 = np.random.random((m, n)) sum2 = np.random.random((p, )) sum2 = sum2[:, np.newaxis]  def orig():     partial_j1 = np.zeros(sparse.theta1.shape)     partial_j2 = np.zeros(sparse.theta2.shape)     partial_b1 = np.zeros(sparse.b1.shape)     partial_b2 = np.zeros(sparse.b2.shape)     delta3t = (-(x - a3) * a3 * (1 - a3)).T     for i in range(m):         delta3 = delta3t[:, i:(i + 1)]         sum1 = np.dot(sparse.theta2.T, delta3)         delta2 = (sum1 + sum2) * a2[i:(i + 1), :].T * (1 - a2[i:(i + 1), :].T)         partial_j1 += np.dot(delta2, a1[i:(i + 1), :])         partial_j2 += np.dot(delta3, a2[i:(i + 1), :])         partial_b1 += delta2         partial_b2 += delta3         # delta3: (64, 1)         # sum1: (25, 1)         # delta2: (25, 1)         # a1[i:(i+1),:]: (1, 64)         # partial_j1: (25, 64)         # partial_j2: (64, 25)         # partial_b1: (25, 1)         # partial_b2: (64, 1)         # a2[i:(i+1),:]: (1, 25)     return partial_j1, partial_j2, partial_b1, partial_b2   def alt():     delta3 = (-(x - a3) * a3 * (1 - a3)).T     sum1 = np.dot(sparse.theta2.T, delta3)     delta2 = (sum1 + sum2) * a2.T * (1 - a2.T)     # delta3: (64, 10000)     # sum1: (25, 10000)     # delta2: (25, 10000)     # a1: (10000, 64)     # a2: (10000, 25)     partial_j1 = np.dot(delta2, a1)     partial_j2 = np.dot(delta3, a2)     partial_b1 = delta2.sum(axis=1)     partial_b2 = delta3.sum(axis=1)     return partial_j1, partial_j2, partial_b1, partial_b2  answer = orig() result = alt() for a, r in zip(answer, result):     try:         assert np.allclose(np.squeeze(a), r)     except AssertionError:         print(a.shape)         print(r.shape)         raise

Tip: Notice that I left in the comments the shape of all the intermediate arrays. Knowing the shape of the arrays helped me understand what your code was doing. The shape of the arrays can help guide you toward the right NumPy functions to use. Or at least, paying attention to the shapes can help you know if an operation is sensible. For example, when you compute

np.dot(A, B)

and A.shape = (n, m) and B.shape = (m, p), then np.dot(A, B) will be an array of shape (n, p).

It can help to build the arrays in C_CONTIGUOUS-order (at least, if using np.dot). There might be as much as a 3x speed up by doing so:

Below, x is the same as xf except that x is C_CONTIGUOUS and xf is F_CONTIGUOUS -- and the same relationship for y and yf.

import numpy as np  m, n, p = 10 ** 4, 64, 25 x = np.random.random((n, m)) xf = np.asarray(x, order='F')  y = np.random.random((m, n)) yf = np.asarray(y, order='F')  assert np.allclose(x, xf) assert np.allclose(y, yf) assert np.allclose(np.dot(x, y), np.dot(xf, y)) assert np.allclose(np.dot(x, y), np.dot(xf, yf))

%timeit benchmarks show the difference in speed:

In [50]: %timeit np.dot(x, y) 100 loops, best of 3: 12.9 ms per loop  In [51]: %timeit np.dot(xf, y) 10 loops, best of 3: 27.7 ms per loop  In [56]: %timeit np.dot(x, yf) 10 loops, best of 3: 21.8 ms per loop  In [53]: %timeit np.dot(xf, yf) 10 loops, best of 3: 33.3 ms per loop

Regarding benchmarking in Python:

It can be misleading to use the difference in pairs of time.time() calls to benchmark the speed of code in Python. You need to repeat the measurement many times. It's better to disable the automatic garbage collector. It is also important to measure large spans of time (such as at least 10 seconds worth of repetitions) to avoid errors due to poor resolution in the clock timer and to reduce the significance of time.time call overhead. Instead of writing all that code yourself, Python provides you with the timeit module. I'm essentially using that to time the pieces of code, except that I'm calling it through an IPython terminal for convenience.

I'm not sure if this is affecting your benchmarks, but be aware it could make a difference. In the question I linked to, according to time.time two pieces of code differed by a factor of 1.7x while benchmarks using timeit showed the pieces of code ran in essentially identical amounts of time.

174

answered Sep 23 '22 05:09

unutbu

Related questions
                            
                                Flask-SQLAlchemy - model has no attribute 'foreign_keys'
                            
                                copy cell style openpyxl
                            
                                Reverse DataFrame column order
                            
                                Scikit-learn: How to run KMeans on a one-dimensional array?
                            
                                Keep TFIDF result for predicting new content using Scikit for Python
                            
                                python console tab completion not working in windows
                            
                                How different are the semantics between Python and JavaScript?
                            
                                How to fetch an email body using imaplib in python?
                            
                                Removing backslashes from a string in Python [duplicate]
                            
                                Python and Server Load
                            
                                Delete blank rows from CSV?
                            
                                How to "embed" a small numpy array into a predefined block of a large numpy array?
                            
                                Finding the full width half maximum of a peak
                            
                                How to plot data from multiple two column text files with legends in Matplotlib?
                            
                                Flatten numpy array
                            
                                Python Requests and Unicode
                            
                                How to return a view of several columns in numpy structured array
                            
                                python regex optional capture group
                            
                                python: Two modules and classes with the same name under different packages
                            
                                Python logging split between stdout and stderr [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference on performance between numpy and matlab

Tags:

performance

python

backpropagation

numpy

matlab

pabaldonedo

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us