Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Is Java Not Utilising All My CPU Cores Effectively [duplicate]

I am running Ubuntu on a machine with a quad core cpu. I have written some test Java code that spawns a given number of processes that simply increment a volatile variable for a given number of iterations when run.

I would expect the running time to not increase significantly while the number of threads are less than or equal to the number of cores i.e. 4. In fact, these are the times I get using "real time" from the UNIX time command:

1 thread: 1.005s

2 threads: 1.018s

3 threads: 1.528s

4 threads: 1.982s

5 threads: 2.479s

6 threads: 2.934s

7 threads: 3.356s

8 threads: 3.793s

This shows that adding one extra thread does not increase the time as expected, but then the time does increase with 3 and 4 threads.

At first I thought this could be because the OS was preventing the JVM from using all the cores, but I ran top, and it clearly showed that with 3 threads, 3 cores were running at ~100%, and with 4 threads, 4 cores were maxed out.

My question is: why is the code running on 3/4 CPUs not roughly the same speed as when it runs on 1/2? Because it is running parallel on all the cores.

Here is my main method for reference:

class Example implements Runnable {

    // using this so the compiler does not optimise the computation away
    volatile int temp;

    void delay(int arg) {
        for (int i = 0; i < arg; i++) {
            for (int j = 0; j < 1000000; j++) {
                this.temp += i + j;
            }
        }
    }

    int arg;
    int result;

    Example(int arg) {
        this.arg = arg;
    }

    public void run() {
        delay(arg);
        result = 42;
    }

    public static void main(String args[]) {

        // Get the number of threads (the command line arg)

        int numThreads = 1;
        if (args.length > 0) {
            try {
                numThreads = Integer.parseInt(args[0]);
            } catch (NumberFormatException nfe) {
                System.out.println("First arg must be the number of threads!");
            }
        }

        // Start up the threads

        Thread[] threadList = new Thread[numThreads];
        Example[] exampleList = new Example[numThreads];
        for (int i = 0; i < numThreads; i++) {
            exampleList[i] = new Example(1000);
            threadList[i] = new Thread(exampleList[i]);
            threadList[i].start();
        }

        // wait for the threads to finish

        for (int i = 0; i < numThreads; i++) {
           try {
                threadList[i].join();
                System.out.println("Joined with thread, ret=" + exampleList[i].result);
            } catch (InterruptedException ie) {
                System.out.println("Caught " + ie);
            }
        }
    }
}
like image 859
Will Sewell Avatar asked Dec 27 '13 22:12

Will Sewell


People also ask

Why is my CPU not using all of its cores?

A CPU not using all the cores could mean your hardware is broken and it needs to be replaced.

Does Java automatically use multiple cores?

Java will benefit from multiple cores, if the OS distribute threads over the available processors. JVM itself do not do anything special to get its threads scheduled evenly across multiple cores.

How can you maximize CPU utilization using Java programming?

You can modify the CPU load incurred by a Java program by inserting sleep statements (e.g. Thread. sleep() ) in your code, using variable delays to change the load. The simplest case would be a sleep statement in a loop, executed in a separate thread for each CPU core that you want to load.

What happens if you enable all CPU cores?

A CPU that offers multiple cores may perform significantly better than a single-core CPU of the same speed. Multiple cores allow PCs to run multiple processes at the same time with greater ease, increasing your performance when multitasking or under the demands of powerful apps and programs.


1 Answers

Using multiple CPUs helps up to the point you saturate some underlying resource.

In your case, the underlying resource is not the number of CPUs but the number of L1 caches you have. In your case it appears you have two cores, with an L1 data cache each and since you are hitting it with a volatile write, it is the L1 caches which are your limiting factor here.

Try accessing the L1 cache less with

public class Example implements Runnable {
    // using this so the compiler does not optimise the computation away
    volatile int temp;

    void delay(int arg) {
        for (int i = 0; i < arg; i++) {
            int temp = 0;
            for (int j = 0; j < 1000000; j++) {
                temp += i + j;
            }
            this.temp += temp;
        }
    }

    int arg;
    int result;

    Example(int arg) {
        this.arg = arg;
    }

    public void run() {
        delay(arg);
        result = 42;
    }

    public static void main(String... ignored) {

        int MAX_THREADS = Integer.getInteger("max.threads", 8);
        long[] times = new long[MAX_THREADS + 1];
        for (int numThreads = MAX_THREADS; numThreads >= 1; numThreads--) {
            long start = System.nanoTime();

            // Start up the threads

            Thread[] threadList = new Thread[numThreads];
            Example[] exampleList = new Example[numThreads];
            for (int i = 0; i < numThreads; i++) {
                exampleList[i] = new Example(1000);
                threadList[i] = new Thread(exampleList[i]);
                threadList[i].start();
            }

            // wait for the threads to finish

            for (int i = 0; i < numThreads; i++) {
                try {
                    threadList[i].join();
                    System.out.println("Joined with thread, ret=" + exampleList[i].result);
                } catch (InterruptedException ie) {
                    System.out.println("Caught " + ie);
                }
            }
            long time = System.nanoTime() - start;
            times[numThreads] = time;
            System.out.printf("%d: %.1f ms%n", numThreads, time / 1e6);
        }
        for (int i = 2; i <= MAX_THREADS; i++)
            System.out.printf("%d: %.3f time %n", i, (double) times[i] / times[1]);
    }
}

On my dual core, hyperthreaded laptop it produces in the form threads: factor

2: 1.093 time 
3: 1.180 time 
4: 1.244 time 
5: 1.759 time 
6: 1.915 time 
7: 2.154 time 
8: 2.412 time 

compared with the original test of

2: 1.092 time 
3: 2.198 time 
4: 3.349 time 
5: 3.079 time 
6: 3.556 time 
7: 4.183 time 
8: 4.902 time 

A common resource to over utilise is the L3 cache. This is shared across CPUs and while it allows a degree of concurrency, it doesn't scale well above to CPUs. I suggest you check what your Example code is doing and make sure they can run independently and not use any shared resources. e.g. Most chips have a limited number of FPUs.

like image 152
Peter Lawrey Avatar answered Sep 23 '22 08:09

Peter Lawrey