I'm writing a multi threading java application that runs on Nehalem processor. However I have a problem that starting from 4 threads I almost don't see the speedup in my application.
I've made some simple test. I've created a thread that just allocates a big array and making access to random entries in the array. So when I run number of threads the running time shouldn't change (assuming I'm not exceeding number of available CPU cores). But what I observed is that running 1 or 2 threads takes almost the same time, but running 4 or 8 threads is significantly slower. So before trying so solve algorithmic and synchronization problem in my application I want to find out what is maximal possible parallelization I can achieve.
I've used -XX:+UseNUMA
JVM option, so the arrays ought to be allocated in the memory near the corresponding threads.
P.S. If the threads were making a simple mathematical calculation there was no time drop for 4 and even 8 threads, so I concluded that when the threads are accessing memory I have some problems.
Any help or ideas are appreciated, thanks.
EDIT
Thanks you all for the replies. I see that I haven't explained myself good enough.
Before trying to eliminate synchronization problems in my application I made a simple test that check best possible parallelization that could be achieved. The code is as follows:
public class TestMultiThreadingArrayAccess {
private final static int arrSize = 40000000;
private class SimpleLoop extends Thread {
public void run() {
int array[] = new int[arrSize];
for (long i = 0; i < arrSize * 10; i++) {
array[(int) ((i * i) % arrSize)]++; // randomize a bit the access to the array
}
long sum = 0;
for (int i = 0; i < arrSize; i++)
sum += array[i];
}
}
public static void main(String[] args) {
TestMultiThreadingArrayAccess test = new TestMultiThreadingArrayAccess();
for (int threadsNumber : new int[] { 1, 2, 4, 8 }) {
Statistics timer = new Statistics("Executing " + threadsNumber+ " threads"); // Statistics is a simple helper class that measures the times
timer.start();
test.doTest(threadsNumber);
timer.stop();
System.out.println(timer.toString());
}
}
public void doTest(int threadsNumber) {
Thread threads[] = new Thread[threadsNumber];
for (int i = 0; i < threads.length; i++) {
threads[i] = new SimpleLoop();
threads[i].start();
}
for (int i = 0; i < threads.length; i++)
try {
threads[i].join();
} catch (InterruptedException e) {
};
}
}
So as you see there is no synchronization at all in this minitest and also the allocation of the array is inside the thread so it should be placed in the chunk of the memory that could be accessed quickly. Also there is no memory contentions in this code. Still for 4 threads there is a drop of 30% in the running time, and 8 threads runs twice slower. As you from the code I just wait until all threads finish theirs work, and since theirs work is independent number of threads shouldn't affect the total time the execution takes.
On the machine installed 2 quad-core hyperthreaded Nehalem processors (totally 16 CPUs), so with 8 threads each one could catch it's CPU exclusively.
When I tried to run this test with smaller array (20K entries) the drop of the execution time of 4 threads was 7% and 8 threads - 14%, which is satisfying. But when I try to operate random accessed on large array (40M entries) running times increase dramatically, so I think that there is problem that big chunks of memory (because they don't fit in the cache memory?) are accessed in a non-efficient way.
Are there any ideas how to fix this?
Hope this clarifies the question in a better way, thanks again.
The bottleneck in the test is the cpu to memory bandwith. Even when local memory is available, it is going to be shared by some number of threads. (The memory is local to a node, not to a specific core.) Once CPU can easily exceed the available bandwidth for a simple loop like your above test, and so increasing threads on such a test will not improve performance, and can worsen performance due to worsened cache coherence.
Just a sanity test, are you also using the parallel collector? -XX:+UseParallelGC
. UseNUMA takes effect only then.
Without knowing what exactly you are doing and what is the problem your trying to solve. It looks like you have heavy synchronization around your code, since it could be the main reason for not to be scalable enough. Over synchronization cause to slow down any speedup, once it make your application almost serial. So my suggestion to you is to inspect your implementation and trying to figure this out.
ADD.
After you've added your implementation of what you are doing. The downgrade of performance could be explained by large and massive memory access. Once you running all you thread and they need to access memory controller for not cached data, since they running on different CPU's, memory controller prevents CPU's from doing it simultaneously, meaning there is a synchronization at hardware level on each cache miss. In you case it's almost equal as if you were running 10 different independent programs. I guess if you will launch 10 (you can replace 10 by any large number) copies your web browser, for instance, you will see same effect, but this doesn't mean that browser implementation is ineffective, you just create a huge burden on computer memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With