Is there a GPU accelerated numpy.max(X, axis=0) implementation in Theano?

Question

Do we have a GPU accelerated of version of numpy.max(X, axis=None) in Theano. I looked into the documentation and found theano.tensor.max(X, axis=None), but it is 4-5 times slower than the numpy implementation.

I can assure you, it is not slow because of some bad choice of matrix size. Same matrix under theano.tensor.exp is 40 times faster than its numpy counterpart.

Any suggestions?

nouiz · Accepted Answer

The previous answer is partial. The suggestion should not work, as the work around is the one used in the final compiled code. There is optimization that will do this transformation automatically.

The title of the question isn't the same as the content. They differ by the axis argument. I'll answer both questions.

If the axis is 0 or None we support this on the GPU for that operation for matrix. If the axis is None, we have a basic implementation that isn't well optimized as it is harder to parallelize. If the axis is 0, we have a basic implementation, but it is faster as it is easier to parallelize.

Also, how did you do your timing? If you just make one function with only that operation and test it via the device=gpu flags to do your comparison, this will include the transfer time between CPU and GPU. This is a memory bound operation, so if you include the transfer in your timming, personnaly I don't expect any speed op for that case. To see only the GPU operation, use Theano profiler: run with the Theano flag profile=True.

lmjohns3 · Answer

The max and exp operations are fundamentally different; exp (and other operations like addition, sin, etc.) is an elementwise operation that is embarrassingly parallelizable, while max requires a parallel-processing scan algorithm that basically builds up a tree of pairwise comparisons over an array. It's not impossible to speed up max, but it's not as easy as exp.

Anyway, the theano implementation of max basically consists of the following lines (in theano/tensor/basic.py):

try:
    out = max_and_argmax(x, axis)[0]
except Exception:
    out = CAReduce(scal.maximum, axis)(x)

where max_and_argmax is a bunch of custom code that, to my eye, implements a max+argmax operation using numpy, and CAReduce is a generic GPU-accelerated scan operation used as a fallback (which, according to the comments, doesn't support grad etc.). You could try using the fallback directly and see whether that is faster, maybe something like this:

from theano.tensor.elemwise import CAReduce
from theano.scalar import maximum

def mymax(X, axis=None):
    CAReduce(maximum, axis)(X)

Is there a GPU accelerated numpy.max(X, axis=0) implementation in Theano?

Tags:

numpy

deep-learning

theano

pycuda

hrs

2 Answers

nouiz

lmjohns3

Recent Activity

Donate For Us

Is there a GPU accelerated numpy.max(X, axis=0) implementation in Theano?

Tags:

numpy

deep-learning

theano

pycuda

hrs

2 Answers

nouiz

lmjohns3

Related questions

Recent Activity

Donate For Us