I've tested an example demonstrated in this talk [pytables] using numpy (page 20/57).
It is stated, that a[:,1].sum()
takes 9.3 ms, whereas a[1,:].sum()
takes only 72 us.
I tried to reproduce it, but failed to do so. Am I measuring wrongly? Or have things changed in NumPy since 2010?
$ python2 -m timeit -n1000 --setup \
'import numpy as np; a = np.random.randn(4000,4000);' 'a[:,1].sum()'
1000 loops, best of 3: 16.5 usec per loop
$ python2 -m timeit -n1000 --setup \
'import numpy as np; a = np.random.randn(4000,4000);' 'a[1,:].sum()'
1000 loops, best of 3: 13.8 usec per loop
$ python2 --version
Python 2.7.7
$ python2 -c 'import numpy; print numpy.version.version'
1.8.1
While I can measure a benefit of the second version (supposedly fewer cache misses because numpy uses C-style row ordering), I don't see that drastic difference as stated by the pytables contributor.
Also, it seems I cannot see more cache misses when using column V row summation.
EDIT
So far the insight for me was that I was using the timeit
module in the wrong way. Repeated runs with the same array (or row/column of an array) will almost certainly be cached (I've got 32KiB
of L1 data cache, so a line fits well inside: 4000 * 4 byte = 15k < 32k
).
Using the script in the answer of @alim with a single loop (nloop=1
) and ten trials nrep=10
, and varying the size of the random array (n x n
) I am measuring
n row/us col/us penalty col
1k 90 100 1
4k 100 210 2
10k* 110 350 3.5
20k* 120 1200 10
* n=10k
and higher doesn't fit into the L1d cache anymore.
I'm still not sure about tracing down the cause of this as perf
shows about the same rate of cache misses (sometimes even a higher rate) for the faster row sum.
Perf
data:nloop = 2
and nrep=2
, so I expect some of the data still in the cache... for the second run.
n=10k
perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/ /g'
row sum: 103.593 us
Performance counter stats for './answer1.py':
25850670 cache-references [30.04%]
1321945 cache-misses # 5.114 % of all cache refs [20.04%]
5706371393 L1-dcache-loads [20.00%]
11733777 L1-dcache-load-misses # 0.21% of all L1-dcache hits [19.97%]
2401264190 L1-dcache-stores [20.04%]
131964213 L1-dcache-store-misses [20.03%]
2007640 L1-dcache-prefetches [20.04%]
21894150686 cycles [20.02%]
24582770606 instructions # 1.12 insns per cycle [30.06%]
3534308182 branches [30.01%]
3767 faults
6 migrations
7.331092823 seconds time elapsed
n=10k
perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/ /g'
column sum: 377.059 us
Performance counter stats for './answer1.py':
26673628 cache-references [30.02%]
1409989 cache-misses # 5.286 % of all cache refs [20.07%]
5676222625 L1-dcache-loads [20.06%]
11050999 L1-dcache-load-misses # 0.19% of all L1-dcache hits [19.99%]
2405281776 L1-dcache-stores [20.01%]
126425747 L1-dcache-store-misses [20.02%]
2128076 L1-dcache-prefetches [20.04%]
21876671763 cycles [20.00%]
24607897857 instructions # 1.12 insns per cycle [30.00%]
3536753654 branches [29.98%]
3763 faults
9 migrations
7.327833360 seconds time elapsed
EDIT2
I think I have understood some aspects, but the question has not been answered yet I think. At the moment I think this summation example doesn't reveal anything about CPU caches at all. In order to eliminate uncertainty by numpy/python, I tried to use perf
on doing the summation in C, and the results are in an answer below.
So, although we calculated the sum of each row, technically it is a column-wise addition rather than a row-wise addition as axis=0 is row and axis=1 is column. Why did NumPy not make it straightforward as in case of pandas?
Column-Wise sum is like row-wise sum, just that now we need to traverse the double dimensional array column-wise rather than row-wise. To do this, we will again use two loops, but the outer loop will run for the number of columns and inner loop will run for the number of rows.
We apply the sum function to each element in a column and find sum of each column accordingly. numpy.sum () function returns the sum of array elements over the specified axis.
Other method to get the row sum in R is by using apply () function. row wise sum of the dataframe is also calculated using dplyr package. rowwise () function of dplyr package along with the sum function is used to calculate row wise sum. we will be looking at the following examples Row wise sum in R dataframe using rowSums ()
I don't see anything wrong with your attempt at replication, but bear in mind that those slides are from 2010, and numpy has changed rather a lot since then. Based on the dates of numpy releases, I would guess that Francesc was probably using v1.5.
Using this script to benchmark row v column sums:
#!python
import numpy as np
import timeit
print "numpy version == " + str(np.__version__)
setup = "import numpy as np; a = np.random.randn(4000, 4000)"
rsum = "a[1, :].sum()"
csum = "a[:, 1].sum()"
nloop = 1000
nrep = 3
print "row sum:\t%.3f us" % (
min(timeit.repeat(rsum, setup, repeat=nrep, number=nloop)) / nloop * 1E6)
print "column sum:\t%.3f us" % (
min(timeit.repeat(csum, setup, repeat=nrep, number=nloop)) / nloop * 1E6)
I detect about a 50% slowdown for column sums with numpy v1.5:
$ python sum_benchmark.py
numpy version == 1.5.0
row sum: 8.472 us
column sum: 12.759 us
Compared with about a 30% slowdown with v1.8.1, which you're using:
$ python sum_benchmark.py
numpy version == 1.8.1
row sum: 12.108 us
column sum: 15.768 us
It's interesting to note that both types of reduction have actually gotten a bit slower in the more recent numpy versions. I would have to delve a lot deeper into numpy's source code to understand exactly why this is the case.
I've updated my script and results above accordingly
We can also negate any effect of caching across calls (temporal locality) by creating a brand-new random array for every call - just set nloop
to 1 and nrep
to a reasonably small number (unless you really enjoy watching paint dry), say 10.
nloop=1
, nreps=10
on a 4000x4000 array:
numpy version == 1.5.0
row sum: 47.922 us
column sum: 103.235 us
numpy version == 1.8.1
row sum: 66.996 us
column sum: 125.885 us
That's a bit more like it, but I still can't really replicate the massive effect that Francesc's slides show. Perhaps this isn't that surprising, though - the effect may be very compiler-, architecture, and/or kernel-dependent.
Interesting. I can reproduce Sebastian's performance:
In [21]: np.__version__
Out[21]: '1.8.1'
In [22]: a = np.random.randn(4000, 4000)
In [23]: %timeit a[:, 1].sum()
100000 loops, best of 3: 12.4 µs per loop
In [24]: %timeit a[1, :].sum()
100000 loops, best of 3: 10.6 µs per loop
However, if I try with a larger array:
In [25]: a = np.random.randn(10000, 10000)
In [26]: %timeit a[:, 1].sum()
10000 loops, best of 3: 21.8 µs per loop
In [27]: %timeit a[1, :].sum()
100000 loops, best of 3: 15.8 µs per loop
but, if I try again:
In [28]: a = np.random.randn(10000, 10000)
In [29]: %timeit a[:, 1].sum()
10000 loops, best of 3: 64.4 µs per loop
In [30]: %timeit a[1, :].sum()
100000 loops, best of 3: 15.9 µs per loop
so, not sure what's going on here, but this jitter is probably due to cache effects. Perhaps new architectures are being wiser in predicting pattern access and hence, doing better prefetching?
At any rate, and for comparison matters, I am using NumPy 1.8.1, Linux Ubuntu 14.04 and a laptop with a i5-3380M CPU @ 2.90GHz.
EDIT: After thinking a bit on this, yes I would say that the first time that timeit executes the sum, the column (or the row) is fetched from RAM, but the second time that the operation runs, the data is in cache (for both the row-wise and column-wise versions), so it executes fast. As timeit takes the minimum of the runs, this is why we don't see a big difference in times.
Another question is why we see the difference sometimes (using timeit). But caches are weird beasts, most specially in multicore machines executing multiple processes at a time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With