I have a program that does a lot of matrix multiplication. I thought I'd speed it up by reducing the number of loops in the code to see how much faster it would be (I'll try a matrix math library later). It turns out it's not faster at all. I've been able to replicate the problem with some example code. My guess was that testOne() would be faster than testTwo() because it doesn't create any new arrays and because it has a third as many loops. On my machine, its takes twice as long to run:
Duration for testOne with 5000 epochs: 657, loopCount: 64000000
Duration for testTwo with 5000 epochs: 365, loopCount: 192000000
My guess is that multOne()
is slower than multTwo()
because in multOne()
the CPU is not writing to sequential memory addresses like it is in multTwo()
. Does that sound right? Any explanations would be appreciated.
import java.util.Random;
public class ArrayTest {
double[] arrayOne;
double[] arrayTwo;
double[] arrayThree;
double[][] matrix;
double[] input;
int loopCount;
int rows;
int columns;
public ArrayTest(int rows, int columns) {
this.rows = rows;
this.columns = columns;
this.loopCount = 0;
arrayOne = new double[rows];
arrayTwo = new double[rows];
arrayThree = new double[rows];
matrix = new double[rows][columns];
Random random = new Random();
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
matrix[i][j] = random.nextDouble();
}
}
}
public void testOne(double[] input, int epochs) {
this.input = input;
this.loopCount = 0;
long start = System.currentTimeMillis();
long duration;
for (int i = 0; i < epochs; i++) {
multOne();
}
duration = System.currentTimeMillis() - start;
System.out.println("Duration for testOne with " + epochs + " epochs: " + duration + ", loopCount: " + loopCount);
}
public void multOne() {
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
arrayOne[i] += matrix[i][j] * arrayOne[i] * input[j];
arrayTwo[i] += matrix[i][j] * arrayTwo[i] * input[j];
arrayThree[i] += matrix[i][j] * arrayThree[i] * input[j];
loopCount++;
}
}
}
public void testTwo(double[] input, int epochs) {
this.loopCount = 0;
long start = System.currentTimeMillis();
long duration;
for (int i = 0; i < epochs; i++) {
arrayOne = multTwo(matrix, arrayOne, input);
arrayTwo = multTwo(matrix, arrayTwo, input);
arrayThree = multTwo(matrix, arrayThree, input);
}
duration = System.currentTimeMillis() - start;
System.out.println("Duration for testTwo with " + epochs + " epochs: " + duration + ", loopCount: " + loopCount);
}
public double[] multTwo(double[][] matrix, double[] array, double[] input) {
double[] newArray = new double[rows];
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
newArray[i] += matrix[i][j] * array[i] * input[j];
loopCount++;
}
}
return newArray;
}
public static void main(String[] args) {
int rows = 100;
int columns = 128;
ArrayTest arrayTest = new ArrayTest(rows, columns);
Random random = new Random();
double[] input = new double[columns];
for (int i = 0; i < columns; i++) {
input[i] = random.nextDouble();
}
arrayTest.testOne(input, 5000);
arrayTest.testTwo(input, 5000);
}
}
There is a simple reason why your tests take different time: they don't do the same thing. Since the two loops you compare are not functionally identical, the number of iterations is not a good metric to look at.
testOne
takes longer than testTwo
because:
In multOne
you update arrayOne[i]
in place, during each iteration
of the j
loop. This means for each iteration of the inner loop j
you are using a new value of arrayOne[i]
, computed in the
previous iteration. This creates a loop carried dependency, which is
harder to optimise for the compiler, because you require the output
of the operation matrix[i][j] * arrayOne[i] * input[j]
on the next
CPU clock cycle. This is not really possible with floating point
operations, which have a latency of a few clock cycles usually, so
it results in stalls, therefore reduced performance.
In testTwo
you
update arrayOne
only once per each iteration of the epoch
, and
since there are no carried dependecies, the loop can be vectorised
efficiently, which results in better cache and arithmetic
performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With