I've usually gotten good performance out of numpy's einsum function (and I like it's syntax). @Ophion's answer to this question shows that - for the cases tested - einsum consistently outperforms the "built-in" functions (sometimes by a little, sometimes by a lot). But I just encountered a case where einsum is much slower. Consider the following equivalent functions: <pre class="prettyprint"><code>(M, K) = (1000000, 20) C = np.random.rand(K, K) X = np.random.rand(M, K) def func_dot(C, X): Y = X.dot(C) return np.sum(Y * X, axis=1) def func_einsum(C, X): return np.einsum('ik,km,im->i', X, C, X) def func_einsum2(C, X): # Like func_einsum but break it into two steps. A = np.einsum('ik,km', X, C) return np.einsum('ik,ik->i', A, X) </code></pre> I expected <code>func_einsum</code> to run fastest but that is not what I encounter. Running on a quad-core cpu with hyperthreading, numpy version 1.9.0.dev-7ae0206, and multithreading with OpenBLAS, I get the following results: <pre class="prettyprint"><code>In [2]: %time y1 = func_dot(C, X) CPU times: user 320 ms, sys: 312 ms, total: 632 ms Wall time: 209 ms In [3]: %time y2 = func_einsum(C, X) CPU times: user 844 ms, sys: 0 ns, total: 844 ms Wall time: 842 ms In [4]: %time y3 = func_einsum2(C, X) CPU times: user 292 ms, sys: 44 ms, total: 336 ms Wall time: 334 ms </code></pre> When I increase <code>K</code> to 200, the differences are more extreme: <pre class="prettyprint"><code>In [2]: %time y1= func_dot(C, X) CPU times: user 4.5 s, sys: 1.02 s, total: 5.52 s Wall time: 2.3 s In [3]: %time y2= func_einsum(C, X) CPU times: user 1min 16s, sys: 44 ms, total: 1min 16s Wall time: 1min 16s In [4]: %time y3 = func_einsum2(C, X) CPU times: user 15.3 s, sys: 312 ms, total: 15.6 s Wall time: 15.6 s </code></pre> Can someone explain why einsum is so much slower here? If it matters, here is my numpy config: <pre class="prettyprint"><code>In [6]: np.show_config() lapack_info: libraries = ['openblas'] library_dirs = ['/usr/local/lib'] language = f77 atlas_threads_info: libraries = ['openblas'] library_dirs = ['/usr/local/lib'] define_macros = [('ATLAS_WITHOUT_LAPACK', None)] language = c include_dirs = ['/usr/local/include'] blas_opt_info: libraries = ['openblas'] library_dirs = ['/usr/local/lib'] define_macros = [('ATLAS_INFO', '"\\"None\\""')] language = c include_dirs = ['/usr/local/include'] atlas_blas_threads_info: libraries = ['openblas'] library_dirs = ['/usr/local/lib'] define_macros = [('ATLAS_INFO', '"\\"None\\""')] language = c include_dirs = ['/usr/local/include'] lapack_opt_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] define_macros = [('ATLAS_WITHOUT_LAPACK', None)] language = f77 include_dirs = ['/usr/local/include'] lapack_mkl_info: NOT AVAILABLE blas_mkl_info: NOT AVAILABLE mkl_info: NOT AVAILABLE </code></pre>

You can have the best of both worlds: <pre class="prettyprint"><code>def func_dot_einsum(C, X): Y = X.dot(C) return np.einsum('ij,ij->i', Y, X) </code></pre> On my system: <pre class="prettyprint"><code>In [7]: %timeit func_dot(C, X) 10 loops, best of 3: 31.1 ms per loop In [8]: %timeit func_einsum(C, X) 10 loops, best of 3: 105 ms per loop In [9]: %timeit func_einsum2(C, X) 10 loops, best of 3: 43.5 ms per loop In [10]: %timeit func_dot_einsum(C, X) 10 loops, best of 3: 21 ms per loop </code></pre> When available, <code>np.dot</code> uses BLAS, MKL, or whatever library you have . So the call to <code>np.dot</code> is almost certainly being multithreaded. <code>np.einsum</code> has its own loops, so doesn't use any of those optimizations, apart from its own use of SIMD to speed things up over a vanilla C implementation. <hr> Then there's the multi-input einsum call that runs much slower... The numpy source for einsum is very complex and I don't fully understand it. So be advised that the following is speculative at best, but here's what I think is going on... When you run something like <code>np.einsum('ij,ij->i', a, b)</code>, the benefit over doing <code>np.sum(a*b, axis=1)</code> comes from avoiding having to instantiate the intermediate array with all the products, and looping twice over it. So at the low level what goes on is something like: <pre class="prettyprint"><code>for i in range(I): out[i] = 0 for j in range(J): out[i] += a[i, j] * b[i, j] </code></pre> Say now that you are after something like: <pre class="prettyprint"><code>np.einsum('ij,jk,ik->i', a, b, c) </code></pre> You could do the same operation as <pre class="prettyprint"><code>np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2)) </code></pre> And what I think einsum does is to run this last code without having to instantiate the huge intermediate array, which certainly makes things faster: <pre class="prettyprint"><code>In [29]: a, b, c = np.random.rand(3, 100, 100) In [30]: %timeit np.einsum('ij,jk,ik->i', a, b, c) 100 loops, best of 3: 2.41 ms per loop In [31]: %timeit np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2)) 100 loops, best of 3: 12.3 ms per loop </code></pre> But if you look at it carefully, getting rid of intermediate storage can be a terrible thing. This is what I think einsum is doing at the low level: <pre class="prettyprint"><code>for i in range(I): out[i] = 0 for j in range(J): for k in range(K): out[i] += a[i, j] * b[j, k] * c[i, k] </code></pre> But you are repeating a ton of operations! If you instead did: <pre class="prettyprint"><code>for i in range(I): out[i] = 0 for j in range(J): temp = 0 for k in range(K): temp += b[j, k] * c[i, k] out[i] += a[i, j] * temp </code></pre> you would be doing <code>I * J * (K-1)</code> less multiplications (and <code>I * J</code> extra additions), and save yourself a ton of time. My guess is that einsum is not smart enough to optimize things at this level. In the source code there is a hint that it only optimizes operations with 1 or 2 operands, not 3. In any case automating this for general inputs seems like anything but simple...

Why is numpy's einsum slower than numpy's built-in functions?

Tags:

performance

python

numpy

linear-algebra

I've usually gotten good performance out of numpy's einsum function (and I like it's syntax). @Ophion's answer to this question shows that - for the cases tested - einsum consistently outperforms the "built-in" functions (sometimes by a little, sometimes by a lot). But I just encountered a case where einsum is much slower. Consider the following equivalent functions:

(M, K) = (1000000, 20)
C = np.random.rand(K, K)
X = np.random.rand(M, K)

def func_dot(C, X):
    Y = X.dot(C)
    return np.sum(Y * X, axis=1)

def func_einsum(C, X):
    return np.einsum('ik,km,im->i', X, C, X)

def func_einsum2(C, X):
    # Like func_einsum but break it into two steps.
    A = np.einsum('ik,km', X, C)
    return np.einsum('ik,ik->i', A, X)

I expected func_einsum to run fastest but that is not what I encounter. Running on a quad-core cpu with hyperthreading, numpy version 1.9.0.dev-7ae0206, and multithreading with OpenBLAS, I get the following results:

In [2]: %time y1 = func_dot(C, X)
CPU times: user 320 ms, sys: 312 ms, total: 632 ms
Wall time: 209 ms
In [3]: %time y2 = func_einsum(C, X)
CPU times: user 844 ms, sys: 0 ns, total: 844 ms
Wall time: 842 ms
In [4]: %time y3 = func_einsum2(C, X)
CPU times: user 292 ms, sys: 44 ms, total: 336 ms
Wall time: 334 ms

When I increase K to 200, the differences are more extreme:

In [2]: %time y1= func_dot(C, X)
CPU times: user 4.5 s, sys: 1.02 s, total: 5.52 s
Wall time: 2.3 s
In [3]: %time y2= func_einsum(C, X)
CPU times: user 1min 16s, sys: 44 ms, total: 1min 16s
Wall time: 1min 16s
In [4]: %time y3 = func_einsum2(C, X)
CPU times: user 15.3 s, sys: 312 ms, total: 15.6 s
Wall time: 15.6 s

Can someone explain why einsum is so much slower here?

If it matters, here is my numpy config:

In [6]: np.show_config()
lapack_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
atlas_threads_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_WITHOUT_LAPACK', None)]
    language = c
    include_dirs = ['/usr/local/include']
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = c
    include_dirs = ['/usr/local/include']
atlas_blas_threads_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = c
    include_dirs = ['/usr/local/include']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_WITHOUT_LAPACK', None)]
    language = f77
    include_dirs = ['/usr/local/include']
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

389

asked Nov 22 '13 16:11

bogatron

Video Answer

1 Answers

You can have the best of both worlds:

def func_dot_einsum(C, X):
    Y = X.dot(C)
    return np.einsum('ij,ij->i', Y, X)

On my system:

In [7]: %timeit func_dot(C, X)
10 loops, best of 3: 31.1 ms per loop

In [8]: %timeit func_einsum(C, X)
10 loops, best of 3: 105 ms per loop

In [9]: %timeit func_einsum2(C, X)
10 loops, best of 3: 43.5 ms per loop

In [10]: %timeit func_dot_einsum(C, X)
10 loops, best of 3: 21 ms per loop

When available, np.dot uses BLAS, MKL, or whatever library you have . So the call to np.dot is almost certainly being multithreaded. np.einsum has its own loops, so doesn't use any of those optimizations, apart from its own use of SIMD to speed things up over a vanilla C implementation.

Then there's the multi-input einsum call that runs much slower... The numpy source for einsum is very complex and I don't fully understand it. So be advised that the following is speculative at best, but here's what I think is going on...

When you run something like np.einsum('ij,ij->i', a, b), the benefit over doing np.sum(a*b, axis=1) comes from avoiding having to instantiate the intermediate array with all the products, and looping twice over it. So at the low level what goes on is something like:

for i in range(I):
    out[i] = 0
    for j in range(J):
        out[i] += a[i, j] * b[i, j]

Say now that you are after something like:

np.einsum('ij,jk,ik->i', a, b, c)

You could do the same operation as

np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))

And what I think einsum does is to run this last code without having to instantiate the huge intermediate array, which certainly makes things faster:

In [29]: a, b, c = np.random.rand(3, 100, 100)

In [30]: %timeit np.einsum('ij,jk,ik->i', a, b, c)
100 loops, best of 3: 2.41 ms per loop

In [31]: %timeit np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))
100 loops, best of 3: 12.3 ms per loop

But if you look at it carefully, getting rid of intermediate storage can be a terrible thing. This is what I think einsum is doing at the low level:

for i in range(I):
    out[i] = 0
    for j in range(J):
        for k in range(K):
            out[i] += a[i, j] * b[j, k] * c[i, k]

But you are repeating a ton of operations! If you instead did:

for i in range(I):
    out[i] = 0
    for j in range(J):
        temp = 0
        for k in range(K):
            temp += b[j, k] * c[i, k]
        out[i] += a[i, j] * temp

you would be doing I * J * (K-1) less multiplications (and I * J extra additions), and save yourself a ton of time. My guess is that einsum is not smart enough to optimize things at this level. In the source code there is a hint that it only optimizes operations with 1 or 2 operands, not 3. In any case automating this for general inputs seems like anything but simple...

188

answered Oct 09 '22 16:10

Jaime

Related questions
                            
                                What are the pythonic way to replace a specific set element?
                            
                                Composite Keys in Sqlalchemy
                            
                                Installing mod_wsgi on WAMP server running on Windows 7
                            
                                Divide DataFrame by first row
                            
                                python mysql delete statement not working
                            
                                Python index error value not in list...on .index(value)
                            
                                Mark ticks in latex in matplotlib
                            
                                Unpack binary data with python
                            
                                Why do I get "UserWarning: Module dap was already imported from None ..."
                            
                                Scrapy : storing the data
                            
                                Specifying object property in dot notation with a variable
                            
                                pil draw text with different colors
                            
                                Django Serving a Download File
                            
                                Getting <generator object <genexpr>
                            
                                string.lower in Python 3
                            
                                Python Pandas - Removing Rows From A DataFrame Based on a Previously Obtained Subset
                            
                                Python vs Iron Python [closed]
                            
                                perform varimax rotation in python using numpy
                            
                                Where shall I put my self-written Python packages?
                            
                                Convert generator object to a dictionary

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With