I've tested an example demonstrated in this talk [pytables] using numpy (page 20/57). It is stated, that <code>a[:,1].sum()</code> takes 9.3 ms, whereas <code>a[1,:].sum()</code> takes only 72 us. I tried to reproduce it, but failed to do so. Am I measuring wrongly? Or have things changed in NumPy since 2010? <pre class="prettyprint"><code>$ python2 -m timeit -n1000 --setup \ 'import numpy as np; a = np.random.randn(4000,4000);' 'a[:,1].sum()' 1000 loops, best of 3: 16.5 usec per loop $ python2 -m timeit -n1000 --setup \ 'import numpy as np; a = np.random.randn(4000,4000);' 'a[1,:].sum()' 1000 loops, best of 3: 13.8 usec per loop $ python2 --version Python 2.7.7 $ python2 -c 'import numpy; print numpy.version.version' 1.8.1 </code></pre> While I can measure a benefit of the second version (supposedly fewer cache misses because numpy uses C-style row ordering), I don't see that drastic difference as stated by the pytables contributor. Also, it seems I cannot see more cache misses when using column V row summation. <hr> EDIT <ul> <li>So far the insight for me was that I was using the <code>timeit</code> module in the wrong way. Repeated runs with the same array (or row/column of an array) will almost certainly be cached (I've got <code>32KiB</code> of L1 data cache, so a line fits well inside: <code>4000 * 4 byte = 15k < 32k</code>).</li> <li> Using the script in the answer of @alim with a single loop (<code>nloop=1</code>) and ten trials <code>nrep=10</code>, and varying the size of the random array (<code>n x n</code>) I am measuring <pre class="prettyprint"><code> n row/us col/us penalty col 1k 90 100 1 4k 100 210 2 10k* 110 350 3.5 20k* 120 1200 10 </code></pre> * <code>n=10k</code> and higher doesn't fit into the L1d cache anymore. </li> </ul> I'm still not sure about tracing down the cause of this as <code>perf</code> shows about the same rate of cache misses (sometimes even a higher rate) for the faster row sum. <h3> <code>Perf</code> data:</h3> <code>nloop = 2</code> and <code>nrep=2</code>, so I expect some of the data still in the cache... for the second run. <h3>Row sum <code>n=10k</code> </h3> <pre class="prettyprint"><code> perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/ /g' row sum: 103.593 us Performance counter stats for './answer1.py': 25850670 cache-references [30.04%] 1321945 cache-misses # 5.114 % of all cache refs [20.04%] 5706371393 L1-dcache-loads [20.00%] 11733777 L1-dcache-load-misses # 0.21% of all L1-dcache hits [19.97%] 2401264190 L1-dcache-stores [20.04%] 131964213 L1-dcache-store-misses [20.03%] 2007640 L1-dcache-prefetches [20.04%] 21894150686 cycles [20.02%] 24582770606 instructions # 1.12 insns per cycle [30.06%] 3534308182 branches [30.01%] 3767 faults 6 migrations 7.331092823 seconds time elapsed </code></pre> <h3>Column sum <code>n=10k</code> </h3> <pre class="prettyprint"><code> perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/ /g' column sum: 377.059 us Performance counter stats for './answer1.py': 26673628 cache-references [30.02%] 1409989 cache-misses # 5.286 % of all cache refs [20.07%] 5676222625 L1-dcache-loads [20.06%] 11050999 L1-dcache-load-misses # 0.19% of all L1-dcache hits [19.99%] 2405281776 L1-dcache-stores [20.01%] 126425747 L1-dcache-store-misses [20.02%] 2128076 L1-dcache-prefetches [20.04%] 21876671763 cycles [20.00%] 24607897857 instructions # 1.12 insns per cycle [30.00%] 3536753654 branches [29.98%] 3763 faults 9 migrations 7.327833360 seconds time elapsed </code></pre> <hr> EDIT2 I think I have understood some aspects, but the question has not been answered yet I think. At the moment I think this summation example doesn't reveal anything about CPU caches at all. In order to eliminate uncertainty by numpy/python, I tried to use <code>perf</code> on doing the summation in C, and the results are in an answer below.

I don't see anything wrong with your attempt at replication, but bear in mind that those slides are from 2010, and numpy has changed rather a lot since then. Based on the dates of numpy releases, I would guess that Francesc was probably using v1.5. Using this script to benchmark row v column sums: <pre class="prettyprint"><code>#!python import numpy as np import timeit print "numpy version == " + str(np.__version__) setup = "import numpy as np; a = np.random.randn(4000, 4000)" rsum = "a[1, :].sum()" csum = "a[:, 1].sum()" nloop = 1000 nrep = 3 print "row sum:\t%.3f us" % ( min(timeit.repeat(rsum, setup, repeat=nrep, number=nloop)) / nloop * 1E6) print "column sum:\t%.3f us" % ( min(timeit.repeat(csum, setup, repeat=nrep, number=nloop)) / nloop * 1E6) </code></pre> I detect about a 50% slowdown for column sums with numpy v1.5: <pre class="prettyprint"><code>$ python sum_benchmark.py numpy version == 1.5.0 row sum: 8.472 us column sum: 12.759 us </code></pre> Compared with about a 30% slowdown with v1.8.1, which you're using: <pre class="prettyprint"><code>$ python sum_benchmark.py numpy version == 1.8.1 row sum: 12.108 us column sum: 15.768 us </code></pre> It's interesting to note that both types of reduction have actually gotten a bit slower in the more recent numpy versions. I would have to delve a lot deeper into numpy's source code to understand exactly why this is the case. <h3>Update</h3> <ul> <li>For the record, I'm running Ubuntu 14.04 (kernel v3.13.0-30) on a quad-core i7-2630QM CPU @ 2.0GHz. Both versions of numpy were pip-installed and compiled using GCC-4.8.1.</li> <li>I realize my original benchmarking script wasn't totally self-explanatory - you need to divide the total time by the number of loops (1000) in order to get the time per call.</li> <li>It also probably makes more sense to take the minimum across repeats rather than the average, since this is more likely to represent the lower bound on the execution time (on top of which you'd get variability due to background processes etc.).</li> </ul> I've updated my script and results above accordingly We can also negate any effect of caching across calls (temporal locality) by creating a brand-new random array for every call - just set <code>nloop</code> to 1 and <code>nrep</code> to a reasonably small number (unless you really enjoy watching paint dry), say 10. <code>nloop=1</code>, <code>nreps=10</code> on a 4000x4000 array: <pre class="prettyprint"><code>numpy version == 1.5.0 row sum: 47.922 us column sum: 103.235 us numpy version == 1.8.1 row sum: 66.996 us column sum: 125.885 us </code></pre> That's a bit more like it, but I still can't really replicate the massive effect that Francesc's slides show. Perhaps this isn't that surprising, though - the effect may be very compiler-, architecture, and/or kernel-dependent.

Column wise sum V row wise sum: Why don't I see a difference using NumPy?

Q: Is the sum of each row in NumPy column-wise or row-wise addition?

So, although we calculated the sum of each row, technically it is a column-wise addition rather than a row-wise addition as axis=0 is row and axis=1 is column. Why did NumPy not make it straightforward as in case of pandas?

Q: How do you sum columns in a column wise array?

Column-Wise sum is like row-wise sum, just that now we need to traverse the double dimensional array column-wise rather than row-wise. To do this, we will again use two loops, but the outer loop will run for the number of columns and inner loop will run for the number of rows.

Q: How to find the sum of each column in NumPy?

We apply the sum function to each element in a column and find sum of each column accordingly. numpy.sum () function returns the sum of array elements over the specified axis.

Q: How to get row wise sum in R Dataframe using dplyr?

Other method to get the row sum in R is by using apply () function. row wise sum of the dataframe is also calculated using dplyr package. rowwise () function of dplyr package along with the sum function is used to calculate row wise sum. we will be looking at the following examples Row wise sum in R dataframe using rowSums ()

Tags:

performance

python

numpy

I've tested an example demonstrated in this talk [pytables] using numpy (page 20/57).

It is stated, that a[:,1].sum() takes 9.3 ms, whereas a[1,:].sum() takes only 72 us.

I tried to reproduce it, but failed to do so. Am I measuring wrongly? Or have things changed in NumPy since 2010?

$ python2 -m timeit -n1000 --setup \ 
  'import numpy as np; a = np.random.randn(4000,4000);' 'a[:,1].sum()' 
1000 loops, best of 3: 16.5 usec per loop

$ python2 -m timeit -n1000 --setup \ 
  'import numpy as np; a = np.random.randn(4000,4000);' 'a[1,:].sum()' 
1000 loops, best of 3: 13.8 usec per loop

$ python2 --version
Python 2.7.7
$ python2 -c 'import numpy; print numpy.version.version'
1.8.1

While I can measure a benefit of the second version (supposedly fewer cache misses because numpy uses C-style row ordering), I don't see that drastic difference as stated by the pytables contributor.

Also, it seems I cannot see more cache misses when using column V row summation.

EDIT

So far the insight for me was that I was using the timeit module in the wrong way. Repeated runs with the same array (or row/column of an array) will almost certainly be cached (I've got 32KiB of L1 data cache, so a line fits well inside: 4000 * 4 byte = 15k < 32k).
Using the script in the answer of @alim with a single loop (nloop=1) and ten trials nrep=10, and varying the size of the random array (n x n) I am measuring
```
 n   row/us col/us    penalty col
 1k     90    100         1
 4k    100    210         2
10k*   110    350         3.5
20k*   120   1200        10
```
* n=10k and higher doesn't fit into the L1d cache anymore.

I'm still not sure about tracing down the cause of this as perf shows about the same rate of cache misses (sometimes even a higher rate) for the faster row sum.

`Perf` data:

nloop = 2 and nrep=2, so I expect some of the data still in the cache... for the second run.

Row sum `n=10k`

 perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/    /g'
row sum:    103.593 us
 Performance counter stats for './answer1.py':
          25850670      cache-references                                             [30.04%]
           1321945      cache-misses              #    5.114 % of all cache refs     [20.04%]
        5706371393      L1-dcache-loads                                              [20.00%]
          11733777      L1-dcache-load-misses     #    0.21% of all L1-dcache hits   [19.97%]
        2401264190      L1-dcache-stores                                             [20.04%]
         131964213      L1-dcache-store-misses                                       [20.03%]
           2007640      L1-dcache-prefetches                                         [20.04%]
       21894150686      cycles                    [20.02%]
       24582770606      instructions              #    1.12  insns per cycle         [30.06%]
        3534308182      branches                                                     [30.01%]
              3767      faults
                 6      migrations
       7.331092823 seconds time elapsed

Column sum `n=10k`

 perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/    /g'
column sum: 377.059 us
 Performance counter stats for './answer1.py':
          26673628      cache-references                                             [30.02%]
           1409989      cache-misses              #    5.286 % of all cache refs     [20.07%]
        5676222625      L1-dcache-loads                                              [20.06%]
          11050999      L1-dcache-load-misses     #    0.19% of all L1-dcache hits   [19.99%]
        2405281776      L1-dcache-stores                                             [20.01%]
         126425747      L1-dcache-store-misses                                       [20.02%]
           2128076      L1-dcache-prefetches                                         [20.04%]
       21876671763      cycles                    [20.00%]
       24607897857      instructions              #    1.12  insns per cycle         [30.00%]
        3536753654      branches                                                     [29.98%]
              3763      faults
                 9      migrations
       7.327833360 seconds time elapsed

EDIT2 I think I have understood some aspects, but the question has not been answered yet I think. At the moment I think this summation example doesn't reveal anything about CPU caches at all. In order to eliminate uncertainty by numpy/python, I tried to use perf on doing the summation in C, and the results are in an answer below.

510

asked Jul 14 '14 12:07

Sebastian

2 Answers

I don't see anything wrong with your attempt at replication, but bear in mind that those slides are from 2010, and numpy has changed rather a lot since then. Based on the dates of numpy releases, I would guess that Francesc was probably using v1.5.

Using this script to benchmark row v column sums:

#!python

import numpy as np
import timeit

print "numpy version == " + str(np.__version__)

setup = "import numpy as np; a = np.random.randn(4000, 4000)"

rsum = "a[1, :].sum()"
csum = "a[:, 1].sum()"

nloop = 1000
nrep = 3

print "row sum:\t%.3f us" % (
    min(timeit.repeat(rsum, setup, repeat=nrep, number=nloop)) / nloop * 1E6)
print "column sum:\t%.3f us" % (
    min(timeit.repeat(csum, setup, repeat=nrep, number=nloop)) / nloop * 1E6)

I detect about a 50% slowdown for column sums with numpy v1.5:

$ python sum_benchmark.py 
numpy version == 1.5.0
row sum:        8.472 us
column sum:     12.759 us

Compared with about a 30% slowdown with v1.8.1, which you're using:

$ python sum_benchmark.py 
numpy version == 1.8.1
row sum:        12.108 us
column sum:     15.768 us

It's interesting to note that both types of reduction have actually gotten a bit slower in the more recent numpy versions. I would have to delve a lot deeper into numpy's source code to understand exactly why this is the case.

Update

For the record, I'm running Ubuntu 14.04 (kernel v3.13.0-30) on a quad-core i7-2630QM CPU @ 2.0GHz. Both versions of numpy were pip-installed and compiled using GCC-4.8.1.
I realize my original benchmarking script wasn't totally self-explanatory - you need to divide the total time by the number of loops (1000) in order to get the time per call.
It also probably makes more sense to take the minimum across repeats rather than the average, since this is more likely to represent the lower bound on the execution time (on top of which you'd get variability due to background processes etc.).

I've updated my script and results above accordingly

We can also negate any effect of caching across calls (temporal locality) by creating a brand-new random array for every call - just set nloop to 1 and nrep to a reasonably small number (unless you really enjoy watching paint dry), say 10.

nloop=1, nreps=10 on a 4000x4000 array:

numpy version == 1.5.0
row sum:        47.922 us
column sum:     103.235 us

numpy version == 1.8.1
row sum:        66.996 us
column sum:     125.885 us

That's a bit more like it, but I still can't really replicate the massive effect that Francesc's slides show. Perhaps this isn't that surprising, though - the effect may be very compiler-, architecture, and/or kernel-dependent.

114

answered Nov 16 '22 11:11

ali_m

Interesting. I can reproduce Sebastian's performance:

In [21]: np.__version__ 
Out[21]: '1.8.1'

In [22]: a = np.random.randn(4000, 4000)

In [23]: %timeit a[:, 1].sum()
100000 loops, best of 3: 12.4 µs per loop

In [24]: %timeit a[1, :].sum()
100000 loops, best of 3: 10.6 µs per loop

However, if I try with a larger array:

In [25]: a = np.random.randn(10000, 10000)

In [26]: %timeit a[:, 1].sum()
10000 loops, best of 3: 21.8 µs per loop

In [27]: %timeit a[1, :].sum()
100000 loops, best of 3: 15.8 µs per loop

but, if I try again:

In [28]: a = np.random.randn(10000, 10000)

In [29]: %timeit a[:, 1].sum()
10000 loops, best of 3: 64.4 µs per loop

In [30]: %timeit a[1, :].sum()
100000 loops, best of 3: 15.9 µs per loop

so, not sure what's going on here, but this jitter is probably due to cache effects. Perhaps new architectures are being wiser in predicting pattern access and hence, doing better prefetching?

At any rate, and for comparison matters, I am using NumPy 1.8.1, Linux Ubuntu 14.04 and a laptop with a i5-3380M CPU @ 2.90GHz.

EDIT: After thinking a bit on this, yes I would say that the first time that timeit executes the sum, the column (or the row) is fetched from RAM, but the second time that the operation runs, the data is in cache (for both the row-wise and column-wise versions), so it executes fast. As timeit takes the minimum of the runs, this is why we don't see a big difference in times.

Another question is why we see the difference sometimes (using timeit). But caches are weird beasts, most specially in multicore machines executing multiple processes at a time.

answered Nov 16 '22 13:11

Francesc

Related questions
                            
                                Find two pairs of pairs that sum to the same value
                            
                                Python Invalid syntax in elif [closed]
                            
                                pybluez installation errors on Mac OS
                            
                                How can I embed a python interpreter frame in python using tkinter?
                            
                                How to handle in_data in Pyaudio callback mode?
                            
                                How to build a Django REST route that extracts multiple arguments from the URL?
                            
                                Using Mechanize (Python) to fill form
                            
                                ImportError: cannot import name add_newdocs
                            
                                Show current cursor position in Selenium
                            
                                Using Middleware to ignore duplicates in Scrapy
                            
                                SQLAlchemy: update from_select
                            
                                How to avoid axis values with 1e7 in pandas and matplotlib
                            
                                How to define a "callable" parameter in a Python docstring?
                            
                                Programmatically rotate monitor
                            
                                ranks within groupby in pandas
                            
                                May someone explain the following os.fork() example to me?
                            
                                Overlapping regex
                            
                                How to unit test a function that uses Popen?
                            
                                Segmentation fault when using boost::numpy::ndarray
                            
                                Is it correct to use a return statement within a try and except statement?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Column wise sum V row wise sum: Why don't I see a difference using NumPy?

Tags:

performance

python

numpy

`Perf` data:

Row sum `n=10k`

Column sum `n=10k`

Sebastian

People also ask

2 Answers

Update

ali_m

Francesc

Recent Activity

Donate For Us

Column wise sum V row wise sum: Why don't I see a difference using NumPy?

Tags:

performance

python

numpy

Perf data:

Row sum n=10k

Column sum n=10k

Sebastian

People also ask

2 Answers

Update

ali_m

Francesc

Related questions

Recent Activity

Donate For Us

`Perf` data:

Row sum `n=10k`

Column sum `n=10k`