To my surprise I get a longer time (10 milliseconds) when "optimizing" multiplications by pregenerating the results in an array compared to the original 8 milliseconds. Is that just a Java quirk or is that general of the PC architecture? I have a Core i5 760 with Java 7, Windows 8 64 Bit.
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
long sum=0;
int[] sqr = new int[1000];
for(int a=1;a<1000;a++) {sqr[a]=a*a;}
for(int b=1;b<1000;b++)
// for(int a=1;a<1000;a++) {sum+=a*a+b*b;}
for(int a=1;a<1000;a++) {sum+=sqr[a]+sqr[b];}
System.out.println(System.currentTimeMillis()-start+"ms");
System.out.println(sum);
}
}
Konrad Rudolph commented on the issues with the benchmarking. So I am ignoring the benchmark and focus on the question:
Is multiplication faster than array access?
Yes, it is very likely. It used to be the other way around 20 or 30 years ago.
Roughly speaking, you can do an integer multiplication in 3 cycles (pessimistic, if you don't get vector instructions), and a memory access costs you 4 cycles if you get it straight from the L1 cache but it is straight downhill from there. For reference, see
Latencies and throughput in Appendix C of the Intel 64 and IA-32 Architectures Optimization Reference Manual
Approximate cost to access various caches and main memory?
Herb Sutter's presentation on this very subject: Machine Architecture: Things Your Programming Language Never Told You
One thing specific to Java was pointed out by Ingo in a comment below: You also get bounds checking in Java, which makes the already slower array access even slower...
A more reasonable benchmark would be:
public abstract class Benchmark {
final String name;
public Benchmark(String name) {
this.name = name;
}
abstract int run(int iterations) throws Throwable;
private BigDecimal time() {
try {
int nextI = 1;
int i;
long duration;
do {
i = nextI;
long start = System.nanoTime();
run(i);
duration = System.nanoTime() - start;
nextI = (i << 1) | 1;
} while (duration < 1000000000 && nextI > 0);
return new BigDecimal((duration) * 1000 / i).movePointLeft(3);
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
@Override
public String toString() {
return name + "\t" + time() + " ns";
}
private static void shuffle(int[] a) {
Random chaos = new Random();
for (int i = a.length; i > 0; i--) {
int r = chaos.nextInt(i);
int t = a[r];
a[r] = a[i - 1];
a[i - 1] = t;
}
}
public static void main(String[] args) throws Exception {
final int[] table = new int[1000];
final int[] permutation = new int[1000];
for (int i = 0; i < table.length; i++) {
table[i] = i * i;
permutation[i] = i;
}
shuffle(permutation);
Benchmark[] marks = {
new Benchmark("sequential multiply") {
@Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += i * i;
}
}
return sum;
}
},
new Benchmark("sequential lookup") {
@Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += table[i];
}
}
return sum;
}
},
new Benchmark("random order multiply") {
@Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += permutation[i] * permutation[i];
}
}
return sum;
}
},
new Benchmark("random order lookup") {
@Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += table[permutation[i]];
}
}
return sum;
}
}
};
for (Benchmark mark : marks) {
System.out.println(mark);
}
}
}
which prints on my intel core duo (yes, it's old):
sequential multiply 2218.666 ns
sequential lookup 1081.220 ns
random order multiply 2416.923 ns
random order lookup 2351.293 ns
So, if I access the lookup array sequentially, which minimizes the number of cache misses, and permits the hotspot JVM to optimize bounds checking on array access, there is a slight improvement on an array of 1000 elements. If we do random access into the array, that advantage disappears. Also, if the table is larger, the lookup gets slower. For instance, for 10000 elements, I get:
sequential multiply 23192.236 ns
sequential lookup 12701.695 ns
random order multiply 24459.697 ns
random order lookup 31595.523 ns
So, array lookup is not faster than multiplication, unless the access pattern is (nearly) sequential and the lookup array small.
In any case, my measurements indicate that a multiplication (and addition) takes merely 4 processor cycles (2.3 ns per loop iteration on a 2GHz CPU). You're unlikely to get much faster than that. Also, unless you do half a billion multiplications per second, the multiplications are not your bottleneck, and optimizing other parts of the code will be more fruitful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With