Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is multiplication faster than array access?

To my surprise I get a longer time (10 milliseconds) when "optimizing" multiplications by pregenerating the results in an array compared to the original 8 milliseconds. Is that just a Java quirk or is that general of the PC architecture? I have a Core i5 760 with Java 7, Windows 8 64 Bit.

public class Test {
    public static void main(String[] args)  {
        long start = System.currentTimeMillis();
        long sum=0;
        int[] sqr = new int[1000];
        for(int a=1;a<1000;a++) {sqr[a]=a*a;}

        for(int b=1;b<1000;b++)
//          for(int a=1;a<1000;a++) {sum+=a*a+b*b;}
            for(int a=1;a<1000;a++) {sum+=sqr[a]+sqr[b];}
        System.out.println(System.currentTimeMillis()-start+"ms");
        System.out.println(sum);
    }
}
like image 393
Konrad Höffner Avatar asked Feb 03 '14 23:02

Konrad Höffner


2 Answers

Konrad Rudolph commented on the issues with the benchmarking. So I am ignoring the benchmark and focus on the question:

Is multiplication faster than array access?

Yes, it is very likely. It used to be the other way around 20 or 30 years ago.

Roughly speaking, you can do an integer multiplication in 3 cycles (pessimistic, if you don't get vector instructions), and a memory access costs you 4 cycles if you get it straight from the L1 cache but it is straight downhill from there. For reference, see

  • Latencies and throughput in Appendix C of the Intel 64 and IA-32 Architectures Optimization Reference Manual

  • Approximate cost to access various caches and main memory?

  • Herb Sutter's presentation on this very subject: Machine Architecture: Things Your Programming Language Never Told You


One thing specific to Java was pointed out by Ingo in a comment below: You also get bounds checking in Java, which makes the already slower array access even slower...

like image 109
Ali Avatar answered Oct 23 '22 19:10

Ali


A more reasonable benchmark would be:

public abstract class Benchmark {

    final String name;

    public Benchmark(String name) {
        this.name = name;
    }

    abstract int run(int iterations) throws Throwable;

    private BigDecimal time() {
        try {
            int nextI = 1;
            int i;
            long duration;
            do {
                i = nextI;
                long start = System.nanoTime();
                run(i);
                duration = System.nanoTime() - start;
                nextI = (i << 1) | 1;
            } while (duration < 1000000000 && nextI > 0);
            return new BigDecimal((duration) * 1000 / i).movePointLeft(3);
        } catch (Throwable e) {
            throw new RuntimeException(e);
        }
    }

    @Override
    public String toString() {
        return name + "\t" + time() + " ns";
    }

    private static void shuffle(int[] a) {
        Random chaos = new Random();
        for (int i = a.length; i > 0; i--) {
            int r = chaos.nextInt(i);
            int t = a[r];
            a[r] = a[i - 1];
            a[i - 1] = t;
        }
    }


    public static void main(String[] args) throws Exception {
        final int[] table = new int[1000];
        final int[] permutation = new int[1000];

        for (int i = 0; i < table.length; i++) {
            table[i] = i * i;
            permutation[i] = i;
        }
        shuffle(permutation);

        Benchmark[] marks = {
            new Benchmark("sequential multiply") {
                @Override
                int run(int iterations) throws Throwable {
                    int sum = 0;
                    for (int j = 0; j < iterations; j++) {
                        for (int i = 0; i < table.length; i++) {
                            sum += i * i;
                        }
                    }
                    return sum;
                }
            },
            new Benchmark("sequential lookup") {
                @Override
                int run(int iterations) throws Throwable {
                    int sum = 0;
                    for (int j = 0; j < iterations; j++) {
                        for (int i = 0; i < table.length; i++) {
                            sum += table[i];
                        }
                    }
                    return sum;
                }
            },
            new Benchmark("random order multiply") {
                @Override
                int run(int iterations) throws Throwable {
                    int sum = 0;
                    for (int j = 0; j < iterations; j++) {
                        for (int i = 0; i < table.length; i++) {
                            sum += permutation[i] * permutation[i];
                        }
                    }
                    return sum;
                }
            },
            new Benchmark("random order lookup") {
                @Override
                int run(int iterations) throws Throwable {
                    int sum = 0;
                    for (int j = 0; j < iterations; j++) {
                        for (int i = 0; i < table.length; i++) {
                            sum += table[permutation[i]];
                        }
                    }
                    return sum;
                }
            }
        };

        for (Benchmark mark : marks) {
            System.out.println(mark);
        }
    }
}

which prints on my intel core duo (yes, it's old):

sequential multiply    2218.666 ns
sequential lookup      1081.220 ns
random order multiply  2416.923 ns
random order lookup    2351.293 ns

So, if I access the lookup array sequentially, which minimizes the number of cache misses, and permits the hotspot JVM to optimize bounds checking on array access, there is a slight improvement on an array of 1000 elements. If we do random access into the array, that advantage disappears. Also, if the table is larger, the lookup gets slower. For instance, for 10000 elements, I get:

sequential multiply    23192.236 ns
sequential lookup      12701.695 ns
random order multiply  24459.697 ns
random order lookup    31595.523 ns

So, array lookup is not faster than multiplication, unless the access pattern is (nearly) sequential and the lookup array small.

In any case, my measurements indicate that a multiplication (and addition) takes merely 4 processor cycles (2.3 ns per loop iteration on a 2GHz CPU). You're unlikely to get much faster than that. Also, unless you do half a billion multiplications per second, the multiplications are not your bottleneck, and optimizing other parts of the code will be more fruitful.

like image 44
meriton Avatar answered Oct 23 '22 19:10

meriton