Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multithreaded matrix multiplication [closed]

I recently started lerning multithreading in java. Since I'm writing a program for numerical calculations at my University, i decided to make some first attempts by programming multithreaded matrix multiplication.

This is my code. Please keep in mind that this was just made as a first attempt and is not very clean.

    public class MultithreadingTest{

        public static void main(String[] args) {
            // TODO Auto-generated method stub
            double[][] matrix1 = randomSquareMatrix(2000);
            double[][] matrix2 = randomSquareMatrix(2000);

            matrixMultiplication(matrix1,matrix2,true);
            matrixMultiplicationSingleThread(matrix1, matrix2);
            try {
                matrixMultiplicationParallel(matrix1,matrix2, true);
            } catch (InterruptedException | ExecutionException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
            try {
                matrixMultiplicationParallel2(matrix1,matrix2, true);
            } catch (InterruptedException | ExecutionException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }

        }

        public static double[][] randomSquareMatrix(int n){
            double[][] mat = new double[n][n];
            Random rand = new Random();
            for(int i=0; i<n; i++) for(int j=0; j<n; j++) mat[i][j]=rand.nextInt(10);
            return mat;
        }
        public static void printSquareMat(double[][] mat){
            int n=mat.length;
            for(int i=0; i<n; i++){ for(int j=0; j<n; j++) System.out.print(mat[i][j]+" "); System.out.print("\n");}
            System.out.print("\n");
        }

        public static void average(double[][] matrix)
        {
            int n=matrix.length;
            double sum=0;
            for(int i=0; i<n; i++) for(int j=0; j<n; j++) sum+=matrix[i][j];

            System.out.println("Average of all Elements of Matrix : "+(sum/(n*n)));
        }

        public static void matrixMultiplication(double[][] matrix1, double[][] matrix2, boolean printMatrix){

            int n=matrix1.length;
            double[][] resultMatrix = new double[n][n];

            double startTime = System.currentTimeMillis();

            for(int i=0; i<n; i++)for(int j=0; j<n; j++)for(int k=0; k<n; k++) resultMatrix[i][j]+=matrix1[i][k]*matrix2[k][j];


            if (printMatrix && n<=5)for(int i=0; i<n; i++){for(int j=0; j<n; j++) System.out.print(resultMatrix[i][j]+" ");System.out.print("\n"); }

            System.out.print("\n");
            System.out.println(((System.currentTimeMillis()-startTime)/1000)+
                    " seconds for matrix of size "+n+" in main thread.");
            average(resultMatrix);
        }

        public static void matrixMultiplicationSingleThread(double[][] m1, double[][] m2)
        {
            int n=m1.length;
            double startTime = System.currentTimeMillis();
            Thread t = new Thread(new multiSingle(m1,m2));
            t.start();
            try {
                t.join();
            } catch (InterruptedException e) {
                // TODO Auto-generated catch block
                System.out.println("Error");
                e.printStackTrace();
            }
            System.out.print("\n");
            System.out.println(((System.currentTimeMillis()-startTime)/1000)+
                    " seconds for matrix of size "+n+" in external Thread.");

        }

        public static void matrixMultiplicationParallel(double[][] matrix1, double[][] matrix2, boolean printMatrix) throws InterruptedException, ExecutionException{

            int n=matrix1.length;
            double[][] resultMatrix=new double[n][n];
            double tmp;
            ExecutorService exe = Executors.newFixedThreadPool(2);
            Future<Double>[][] result = new Future[n][n];
            double startTime = System.currentTimeMillis();
            for(int i=0; i<n; i++)
            {
                for(int j=0; j<=i; j++)
                {
                    tmp=matrix2[i][j];
                    matrix2[i][j]=matrix2[j][i];
                    matrix2[j][i]=tmp;
                }
            }

            for(int i=0; i<n; i++)
            {
                for(int j=0; j<n; j++)
                {
                    result[i][j] = exe.submit(new multi(matrix1[i],matrix2[j]));
                }
            }

            exe.shutdown();
            exe.awaitTermination(1, TimeUnit.DAYS);

            for(int i=0; i<n; i++)
            {
                for(int j=0; j<n; j++)
                {
                    resultMatrix[i][j] = result[i][j].get();
                }
            }
            for(int i=0; i<n; i++)
            {
                for(int j=0; j<=i; j++)
                {
                    tmp=matrix2[i][j];
                    matrix2[i][j]=matrix2[j][i];
                    matrix2[j][i]=tmp;
                }
            }
            if (printMatrix && n<=5)for(int i=0; i<n; i++){for(int j=0; j<n; j++) System.out.print(resultMatrix[i][j]+" ");System.out.print("\n"); }

            System.out.print("\n");
            System.out.println(((System.currentTimeMillis()-startTime)/1000)+
                    " seconds for matrix of size "+n+" multithreaded with algorithm 1.");
            average(resultMatrix);
        }

        public static void matrixMultiplicationParallel2(double[][] matrix1, double[][] matrix2, boolean printMatrix) throws InterruptedException, ExecutionException{

            int n=matrix1.length;
            double[][] resultMatrix=new double[n][n];
            double tmp;
            ExecutorService exe = Executors.newFixedThreadPool(2);
            Future<Double>[][] result = new Future[n][n];
            double startTime = System.currentTimeMillis();


            for(int i=0; i<n; i++)
            {
                for(int j=0; j<n; j++)
                {
                    result[i][j] = exe.submit(new multi2(i,j,matrix1,matrix2));
                }
            }

            exe.shutdown();

            exe.awaitTermination(1, TimeUnit.DAYS);


            for(int i=0; i<n; i++)
            {
                for(int j=0; j<n; j++)
                {
                    resultMatrix[i][j] = result[i][j].get();
                }
            }

            if (printMatrix && n<=5)for(int i=0; i<n; i++){for(int j=0; j<n; j++) System.out.print(resultMatrix[i][j]+" ");System.out.print("\n"); }

            System.out.print("\n");
            System.out.println(((System.currentTimeMillis()-startTime)/1000)+
                    " seconds for matrix of size "+n+" multithreaded with algorithm 2.");
            average(resultMatrix);
        }

        public static class multi implements Callable<Double>{

            multi(double[] vec1, double[] vec2){
                this.vec1=vec1; this.vec2=vec2;
            }
            double result;
            double[] vec1, vec2;

            @Override
            public Double call() {
                result=0;
                for(int i=0; i<vec1.length; i++) result+=vec1[i]*vec2[i];
                return result;
            }
        }

        public static class multi2 implements Callable<Double>{

            multi2(int a, int b, double[][] vec1, double[][] vec2){
                this.a=a; this.b=b; this.vec1=vec1; this.vec2=vec2;
            }
            int a,b;
            double result;
            double[][] vec1, vec2;

            @Override
            public Double call() {
                result=0;
                for(int i=0; i<vec1.length; i++) result+=vec1[a][i]*vec2[i][b];
                return result;
            }
        }

        public static class multiSingle implements Runnable{

            double[][] matrix1, matrix2;

            multiSingle(double[][] m1, double[][] m2){
                matrix1=m1;
                matrix2=m2;
            }
            public static void matrixMultiplication(double[][] matrix1, double[][] matrix2, boolean printMatrix){

                int n=matrix1.length;
                double[][] resultMatrix = new double[n][n];

                for(int i=0; i<n; i++)for(int j=0; j<n; j++)for(int k=0; k<n; k++) resultMatrix[i][j]+=matrix1[i][k]*matrix2[k][j];

                MultithreadingTest.average(resultMatrix);
            }

            @Override
            public void run() {
                matrixMultiplication(matrix1, matrix2, false);
            }
        }

    }

I have two general questions to multithreading, i hope it's ok not opening a new topic for this.

  1. Is there a way to write the code without additional classes for the threads that implement runnable or callable? I looked at approaches using anonymous inner classes and lambdas, but as far as I have fount information I can't pass parameters to the threads this way since run() and call() don't take any, that is, unless the parameters are final. But assuming I write a class for matrix operations, I would prefer not to write an additinal class for every operation I want to run in a thread.
  2. Assuming my class does many multithreaded operations, creating a new thread pool and shutting it down in every method would waste a lot of resources, I guess. So I would like to create a thread pool as member ob my class, instantiating it when needed and using invokeAll. But what happens if my object is deleted? Will I get problems since I never shut down the thread pool? In C++ I would use a destructor for this. Or does the gc take care of everything in this case?

Now concering my code directly:

I implemented matrix multiplication in four different ways, as method running in my main thread, as method running in a new thread, but still not multithreaded ( to ensure there would not be any background taks in my main thread slowing it down ) , and two different ways of multithreaded matrix multiplication. The first version transposes the second matrix, submits the multiplication as vector-vector-multiplication and transposes the matrix back to its original form. The second version takes the matrices directly and additinaly takes two indices to define the row and column of the matrices for vector-vector-multiplication.

For all versions I measured the time needed for multiplication and calculated the average value af the resulting matrices to see whether the reults are the same.

I've run this code on two computers, both the same JVM and both Windows 10. The first is my Laptop, i5 5th generation, 2,6 Ghz dual core, and the second one in my Desktop PC, i5 4th generation, 4,2 Ghz quad core.

I expected my Desktop-PC to be much faster. I also expected the multithreaded versions to take about half/quarter the time of the signle threaded version, but still more since there is additional work for creating the threads etc. And last, i expected the second multithreaded version, which does not transpose one matrix twice, to be faster, since there are less operations.

After running the code I'm a little confused about the results, i hope someone can explain it to me:

For both single threaded methods my Laptop needs roughly 340s (for a matrix size of 3000). So i assume there are no expensive background tasks done in my main thread. My Desktop Pc on the other hand need 440s. Now the question is, why is my Laptop, which is defenitely slower, that much faster? Even if the fifth generation is faster than the 4th generation, since my Desktop PC runs at 1.6 times my Laptop's speed i would still expect it to be faster. The difference between those generations is unlikely that big.

For the multithreaded methods my Laptop need roughly 34s. If the multithreading would be perfect, then it should not take less than half. Why is it on two threads ten times faster? The same goes for my Desktop PC. Using four threads the multiplication is done in 16s instead of 440s. This is like my Desktop Pc is working with the same speed as my Laptop, just on four instead of two threads.

Now for the comparison between the two multithreaded methods, the version which transposes one matrix twice takes roughly 34s on my Laptop, the version which takes the matrices directly takes roughly 200s. That sounds realistic, since it's more than half of the single threaded method. But why is it that much slower than the first version? I would assume that transposing a matrix twice would be slower than the additional time to get the matrix elements? Is there something i'm missing or is working with a matrix really that much slower than working with a vector?

I hope someone can answer these questions. Sorry for writing such a long post.

Yours sincerely Thorsten

like image 356
Thorsten Schmitz Avatar asked Sep 25 '22 09:09

Thorsten Schmitz


1 Answers

The answer to the big mystery this this: The time required to do the matrix multiplication is dominated by the time spent moving data from RAM to the CPU cache. You may have 4 cores, but you only have 1 RAM bus, so you won't get any benefit by using more cores (multithreading) if they all block each other waiting for memory access.

The first experiment you should try is this: Write the single-threaded version using the matrix transpose and vector multiplication. You will find that it is MUCH faster -- probably about as fast as the multithreaded version with the transpose.

The reason why the original single-threaded version is so slow, is because it has to load a cache block for every cell in the column being multiplied. If you use the matrix transpose, then all those cells are sequential in memory, and loading one block gets you a bunch of them.

So, if you want to optimize matrix multiplication, FIRST optimize memory access for cache efficiency, THEN divide the work up between a few threads -- no more than twice as many threads as you have cores. Anything more just wastes time and resources with context switches, etc.

regarding your other questions:

1) It's convenient to use lambdas that capture variables from the scope that creates them, like:

for(int i=0; i<n; i++)
{
    for(int j=0; j<n; j++)
    {
        final double[] v1 = matrix1[i];
        final double[] v2 = matrix2[j];
        result[i][j] = exe.submit(() -> vecdot(v1,v2));
    }
}

2) The GC will take care of it. You don't need to explicitly shut down the thread pool to free any resources.

like image 57
Matt Timmermans Avatar answered Sep 28 '22 04:09

Matt Timmermans