I am trying to use Theano to speed up code that is already implemented in numpy that sums the elements in an array. In numpy, the function looks like below
import numpy as np
def numpy_fn(k0, kN, x):
output = np.zeros_like(x)
for k in range(k0, kN+1):
output += k*x
return output
with a sample call
>>> numpy_fn(1, 3, np.arange(10))
array([ 0., 6., 12., 18., 24., 30., 36., 42., 48., 54.])
The theano equivalent of the above function is
import theano
import theano.tensor as tt
k = tt.scalar('k')
k0 = tt.scalar('k0')
kN = tt.scalar('kN')
x = tt.vector('x')
def fn(k, sumtodate):
return sumtodate + k*x
rslt, updt = theano.scan(fn=fn,
outputs_info=tt.zeros_like(x),
sequences=tt.arange(k0, kN+1))
theano_fn = theano.function(inputs=[k0, kN, x],
outputs=rslt[-1])
When called, this gives the correct output
theano_fn(1, 3, np.arange(10))
array([ 0., 6., 12., 18., 24., 30., 36., 42., 48., 54.])
However, when I benchmark the two, the numpy function surpasses theano in speed by a factor of three on my computer.
%timeit theano_fn(1, 1000, np.ones(10000))
10 loops, best of 3: 21.5 ms per loop
%timeit numpy_fn(1, 1000, np.ones(10000))
100 loops, best of 3: 7.9 ms per loop
Since theano converts the outerloop into C, should it not be faster than Python? What can be done to speed up the theano code?
EDIT:
I am aware that the brute code in numpy can be optimized using summations, but the reason why I wanted to take the theano route was because I am interested in cases where the update to output can be any general function of k and x, say
output += x**k
output += exp(k*x)
output += (x-k)**2
output += k*x was just one specific example to make the point. Using mathematical notation what I am trying to implement is a fast summation \sum_{k=k0}^{kN} f(k, x), where k0 and kN are integers, and x is a vector, and f can be any general function of k and x like the ones given above.
import numpy as np
def f(k, x):
return x**k
def numpy_fn(k0, kN, x):
output = np.zeros_like(x)
for k in range(k0, kN+1):
output += f(k, x)
return output
I was hoping that by using theano, I would be able to optimize the outter loop, and would get a faster solution than the brute numpy solution.
Building on Divakar's answer...
The circumstances where Theano can outperform numpy are quite specific. In general, Theano will only perform well in comparison to numpy when the computation involves vectorisable operations on large tensors.
In this case the operation can be performed very efficiently in numpy. One need not use a loop at all by using the standard result for the sum of an arithmetic sequence. Here n = kN - k0 + 1 is the number of items to sum.
numpy.arange(k0, kN + 1).sum() == (kN - k0 + 1) * (k0 + kN) / 2
If one needs to use Theano for some reason other than performance (e.g. to obtain gradients, or as part of some larger symbolic computation) then one can compute the same result without using sum or scan, just as in numpy.
The following code implements the original numpy and Theano approaches, and compares them with Divakar's numpy approaches (and my Theano version of the arange sum approach) plus my numpy and Theano approaches that use the standard sum of an arithmetic sequence result.
import numpy
import timeit
import itertools
import theano
import theano.tensor as tt
def numpy1(k0, kN, x):
output = numpy.zeros_like(x)
for k in range(k0, kN + 1):
output += k * x
return output
def numpy2(k0, kN, x):
return numpy.arange(k0, kN + 1).sum() * x
def numpy3(k0, kN, x):
return numpy.einsum('i->', numpy.arange(k0, kN + 1)) * x
def theano1_step(k, s_tm1, x):
return s_tm1 + k * x
def compile_theano1():
k0 = tt.lscalar()
kN = tt.lscalar()
x = tt.vector()
outputs, _ = theano.scan(theano1_step, sequences=[tt.arange(k0, kN + 1)], outputs_info=[tt.zeros_like(x)],
non_sequences=[x], strict=True)
return theano.function([k0, kN, x], outputs=outputs[-1])
def compile_theano2():
k0 = tt.lscalar()
kN = tt.lscalar()
x = tt.vector()
return theano.function([k0, kN, x], outputs=tt.arange(k0, kN + 1).sum() * x)
def numpy4(k0, kN, x):
return ((kN - k0 + 1) * (k0 + kN) / 2) * x
def compile_theano4():
k0 = tt.lscalar()
kN = tt.lscalar()
x = tt.vector()
return theano.function([k0, kN, x], outputs=((kN - k0 + 1) * (k0 + kN) / 2) * x)
def main():
iteration_count = 10
k0 = 10
kN = 10000
x = numpy.random.standard_normal(size=(20000,)).astype(theano.config.floatX)
functions = [numpy1, numpy2, numpy3, numpy4, compile_theano1(), compile_theano2(), compile_theano4()]
function_count = len(functions)
results = numpy.empty((iteration_count * function_count, x.shape[0]), dtype=theano.config.floatX)
times = numpy.empty((iteration_count * function_count,), dtype=theano.config.floatX)
for iteration in xrange(iteration_count):
for function_index, function in enumerate(functions):
start = timeit.default_timer()
results[iteration * function_count + function_index] = function(k0, kN, x)
times[iteration * function_count + function_index] = timeit.default_timer() - start
for result1, result2 in itertools.izip(results[0::2], results[1::2]):
assert numpy.allclose(result1, result2)
for function_name, function_index in itertools.izip(
('numpy1', 'numpy2', 'numpy3', 'numpy4', 'theano1', 'theano2', 'theano4'),
xrange(function_count)):
time = times[function_index::function_count].mean()
print '%8s %.8f' % (function_name, float(time))
main()
On my crappy desktop computer, which uses a CPU (not a GPU) for the Theano computations, I get the following timings (in seconds, lower is better):
numpy1 0.27894366
numpy2 0.00011240
numpy3 0.00008502
numpy4 0.00006357
theano1 0.99175695
theano2 0.00040656
theano4 0.00017563
In this particular case, running the Theano code on a GPU is unlikely to be useful unless x is very large. But even then the cost of copying x into GPU memory could wash out any gain from the parallel element-wise multiplication.
EDIT
To address the new part in the edited version of the question...
Theano is not good with explicit loops. If you can vectorize the function f then the computation can be performed much more efficiently (in time but perhaps not space) in both numpy and Theano by computing the sum along the x axis of the vectorized result.
For example, if you want output += exp(k*x) then you can achieve this in numpy without an explicit loop like this:
k = numpy.arange(k0, kN + 1)
result = numpy.exp(numpy.outer(x, k)).sum(axis=0)
If f cannot be vectorized or a loop is necessary for some other reason then Theano may or may not offer better performance. You would have to try it out to find out. When an explicit loop is necessary, Theano is only likely to beat numpy if the computation occurring on the inside of the loop involves very large tensor operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With