Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Column wise sum V row wise sum: Why don't I see a difference using NumPy?

I've tested an example demonstrated in this talk [pytables] using numpy (page 20/57).

It is stated, that a[:,1].sum() takes 9.3 ms, whereas a[1,:].sum() takes only 72 us.

I tried to reproduce it, but failed to do so. Am I measuring wrongly? Or have things changed in NumPy since 2010?

$ python2 -m timeit -n1000 --setup \ 
  'import numpy as np; a = np.random.randn(4000,4000);' 'a[:,1].sum()' 
1000 loops, best of 3: 16.5 usec per loop

$ python2 -m timeit -n1000 --setup \ 
  'import numpy as np; a = np.random.randn(4000,4000);' 'a[1,:].sum()' 
1000 loops, best of 3: 13.8 usec per loop

$ python2 --version
Python 2.7.7
$ python2 -c 'import numpy; print numpy.version.version'
1.8.1

While I can measure a benefit of the second version (supposedly fewer cache misses because numpy uses C-style row ordering), I don't see that drastic difference as stated by the pytables contributor.

Also, it seems I cannot see more cache misses when using column V row summation.


EDIT

  • So far the insight for me was that I was using the timeit module in the wrong way. Repeated runs with the same array (or row/column of an array) will almost certainly be cached (I've got 32KiB of L1 data cache, so a line fits well inside: 4000 * 4 byte = 15k < 32k).

  • Using the script in the answer of @alim with a single loop (nloop=1) and ten trials nrep=10, and varying the size of the random array (n x n) I am measuring

     n   row/us col/us    penalty col
     1k     90    100         1
     4k    100    210         2
    10k*   110    350         3.5
    20k*   120   1200        10
    

    * n=10k and higher doesn't fit into the L1d cache anymore.

I'm still not sure about tracing down the cause of this as perf shows about the same rate of cache misses (sometimes even a higher rate) for the faster row sum.

Perf data:

nloop = 2 and nrep=2, so I expect some of the data still in the cache... for the second run.

Row sum n=10k

 perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/    /g'
row sum:    103.593 us
 Performance counter stats for './answer1.py':
          25850670      cache-references                                             [30.04%]
           1321945      cache-misses              #    5.114 % of all cache refs     [20.04%]
        5706371393      L1-dcache-loads                                              [20.00%]
          11733777      L1-dcache-load-misses     #    0.21% of all L1-dcache hits   [19.97%]
        2401264190      L1-dcache-stores                                             [20.04%]
         131964213      L1-dcache-store-misses                                       [20.03%]
           2007640      L1-dcache-prefetches                                         [20.04%]
       21894150686      cycles                    [20.02%]
       24582770606      instructions              #    1.12  insns per cycle         [30.06%]
        3534308182      branches                                                     [30.01%]
              3767      faults
                 6      migrations
       7.331092823 seconds time elapsed

Column sum n=10k

 perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/    /g'
column sum: 377.059 us
 Performance counter stats for './answer1.py':
          26673628      cache-references                                             [30.02%]
           1409989      cache-misses              #    5.286 % of all cache refs     [20.07%]
        5676222625      L1-dcache-loads                                              [20.06%]
          11050999      L1-dcache-load-misses     #    0.19% of all L1-dcache hits   [19.99%]
        2405281776      L1-dcache-stores                                             [20.01%]
         126425747      L1-dcache-store-misses                                       [20.02%]
           2128076      L1-dcache-prefetches                                         [20.04%]
       21876671763      cycles                    [20.00%]
       24607897857      instructions              #    1.12  insns per cycle         [30.00%]
        3536753654      branches                                                     [29.98%]
              3763      faults
                 9      migrations
       7.327833360 seconds time elapsed

EDIT2 I think I have understood some aspects, but the question has not been answered yet I think. At the moment I think this summation example doesn't reveal anything about CPU caches at all. In order to eliminate uncertainty by numpy/python, I tried to use perf on doing the summation in C, and the results are in an answer below.

like image 510
Sebastian Avatar asked Jul 14 '14 12:07

Sebastian


People also ask

Is the sum of each row in NumPy column-wise or row-wise addition?

So, although we calculated the sum of each row, technically it is a column-wise addition rather than a row-wise addition as axis=0 is row and axis=1 is column. Why did NumPy not make it straightforward as in case of pandas?

How do you sum columns in a column wise array?

Column-Wise sum is like row-wise sum, just that now we need to traverse the double dimensional array column-wise rather than row-wise. To do this, we will again use two loops, but the outer loop will run for the number of columns and inner loop will run for the number of rows.

How to find the sum of each column in NumPy?

We apply the sum function to each element in a column and find sum of each column accordingly. numpy.sum () function returns the sum of array elements over the specified axis.

How to get row wise sum in R Dataframe using dplyr?

Other method to get the row sum in R is by using apply () function. row wise sum of the dataframe is also calculated using dplyr package. rowwise () function of dplyr package along with the sum function is used to calculate row wise sum. we will be looking at the following examples Row wise sum in R dataframe using rowSums ()


2 Answers

I don't see anything wrong with your attempt at replication, but bear in mind that those slides are from 2010, and numpy has changed rather a lot since then. Based on the dates of numpy releases, I would guess that Francesc was probably using v1.5.

Using this script to benchmark row v column sums:

#!python

import numpy as np
import timeit

print "numpy version == " + str(np.__version__)

setup = "import numpy as np; a = np.random.randn(4000, 4000)"

rsum = "a[1, :].sum()"
csum = "a[:, 1].sum()"

nloop = 1000
nrep = 3

print "row sum:\t%.3f us" % (
    min(timeit.repeat(rsum, setup, repeat=nrep, number=nloop)) / nloop * 1E6)
print "column sum:\t%.3f us" % (
    min(timeit.repeat(csum, setup, repeat=nrep, number=nloop)) / nloop * 1E6)

I detect about a 50% slowdown for column sums with numpy v1.5:

$ python sum_benchmark.py 
numpy version == 1.5.0
row sum:        8.472 us
column sum:     12.759 us

Compared with about a 30% slowdown with v1.8.1, which you're using:

$ python sum_benchmark.py 
numpy version == 1.8.1
row sum:        12.108 us
column sum:     15.768 us

It's interesting to note that both types of reduction have actually gotten a bit slower in the more recent numpy versions. I would have to delve a lot deeper into numpy's source code to understand exactly why this is the case.

Update

  • For the record, I'm running Ubuntu 14.04 (kernel v3.13.0-30) on a quad-core i7-2630QM CPU @ 2.0GHz. Both versions of numpy were pip-installed and compiled using GCC-4.8.1.
  • I realize my original benchmarking script wasn't totally self-explanatory - you need to divide the total time by the number of loops (1000) in order to get the time per call.
  • It also probably makes more sense to take the minimum across repeats rather than the average, since this is more likely to represent the lower bound on the execution time (on top of which you'd get variability due to background processes etc.).

I've updated my script and results above accordingly

We can also negate any effect of caching across calls (temporal locality) by creating a brand-new random array for every call - just set nloop to 1 and nrep to a reasonably small number (unless you really enjoy watching paint dry), say 10.

nloop=1, nreps=10 on a 4000x4000 array:

numpy version == 1.5.0
row sum:        47.922 us
column sum:     103.235 us

numpy version == 1.8.1
row sum:        66.996 us
column sum:     125.885 us

That's a bit more like it, but I still can't really replicate the massive effect that Francesc's slides show. Perhaps this isn't that surprising, though - the effect may be very compiler-, architecture, and/or kernel-dependent.

like image 114
ali_m Avatar answered Nov 16 '22 11:11

ali_m


Interesting. I can reproduce Sebastian's performance:

In [21]: np.__version__ 
Out[21]: '1.8.1'

In [22]: a = np.random.randn(4000, 4000)

In [23]: %timeit a[:, 1].sum()
100000 loops, best of 3: 12.4 µs per loop

In [24]: %timeit a[1, :].sum()
100000 loops, best of 3: 10.6 µs per loop

However, if I try with a larger array:

In [25]: a = np.random.randn(10000, 10000)

In [26]: %timeit a[:, 1].sum()
10000 loops, best of 3: 21.8 µs per loop

In [27]: %timeit a[1, :].sum()
100000 loops, best of 3: 15.8 µs per loop

but, if I try again:

In [28]: a = np.random.randn(10000, 10000)

In [29]: %timeit a[:, 1].sum()
10000 loops, best of 3: 64.4 µs per loop

In [30]: %timeit a[1, :].sum()
100000 loops, best of 3: 15.9 µs per loop

so, not sure what's going on here, but this jitter is probably due to cache effects. Perhaps new architectures are being wiser in predicting pattern access and hence, doing better prefetching?

At any rate, and for comparison matters, I am using NumPy 1.8.1, Linux Ubuntu 14.04 and a laptop with a i5-3380M CPU @ 2.90GHz.

EDIT: After thinking a bit on this, yes I would say that the first time that timeit executes the sum, the column (or the row) is fetched from RAM, but the second time that the operation runs, the data is in cache (for both the row-wise and column-wise versions), so it executes fast. As timeit takes the minimum of the runs, this is why we don't see a big difference in times.

Another question is why we see the difference sometimes (using timeit). But caches are weird beasts, most specially in multicore machines executing multiple processes at a time.

like image 24
Francesc Avatar answered Nov 16 '22 13:11

Francesc