Problem
I have a piece of java code (JDK 1.6.0._22 if relevant) that implements a stateless, side effect free function with no mutexes. It does however use a lot of memory (I don't know if that is relevant).
In the past I have visited Sun Laboratories and gathered the standard "performance vs number of threads" curve. As this function has no mutexs, it has a nice graph although the garbage collection kicked in as the number of threads increased. After some garbage collection tuning I was able to make this curve almost flat.
I am now doing the same experiment on Intel hardware. The hardware has 4 CPUs each with 8 cores, and hyperthreading. This gives 64 availableProcessors(). Unfortunately the curve of "performance vs number of threads" scales nicely for 1, 2, 3 threads, and caps at 3 threads. After 3 threads I can put as many threads as I want to the task, and the performance gets no better
Attempts to fix the Problem
My first thought was that I had been stupid and introduced some synchronised code somewhere. Normally to resolve this issue I run JConsole or JVisualVM, and look at the thread stacktraces. If I have 64 threads running at the speed of 3, I would expect 61 of them to be sitting waiting to enter a mutex. I didn't find this. Instead I found all the threads running: just very slowly.
A second thought was that perhaps the timing framework was introducing problems. I replaced my function with a dummy function that just counts to a billion using an AtomicLong. This scaled beautifully with number of threads: I was able to count to a billion 10,000 times 64 times quicker with with 64 threads than with 1 thread.
I thought (desperation kicking in) perhaps garbage collection is taking a really really long time, so I tweaked the garbage collection parameters. While this improved my latency variation, it had no effect on throughput: I still have 64 threads running at the speed I expect 3 to run at.
I have downloaded the intel tool VTunes, but my skill with it is weak: it is a complex tool and I don't understand it yet. I have the instruction book on order: a fun Christmas present to myself, but that is a little too late to help my current problem
Question
In fact, multithreading can be slower due to the overhead of creating the threads and context switching between them. The multithreaded program performed worse due to the overhead of creating 100 threads and forcing them all to wait with the mutex .
Java's garbage collector is very robust in terms of circular referencing, I don't see why It won't work with multiple threads running at the same time. So it is safe for you to assume that you don't need to write a special program for garbage collection, because java will do it for you very effectively.
I have a piece of java code (JDK 1.6.0._22 if relevant)
There have been quite allot of performance improvements since then. I would try Java 6 update 37 or Java 7 update 10.
It does however use a lot of memory
This can mean the way you access your data can be important. Accessing data in main memory can be 20+x slower than in your primary cache. This means you have to access data conservatively and make the most of each piece of new data you access.
After 3 threads I can put as many threads as I want to the task, and the performance gets no better Instead I found all the threads running: just very slowly.
This suggest you are using a resource to it's maximum. The most likely resource to be maxed out given the amount of memory you are using is the cpu to main memory bridge. I suspect you have one bridge for 64 threads! This means you should consider ways which might use more cpu but improve the way you access memory (less randomly and more sequentially) and reduce the volumes when you do (use more compact types where possible). e.g. I have "short with two decimal places" type instead of float which can half the memory used.
As you observed, when each thread is updating it's own private AtomicLong you got linear scalability. This won't use the cpu to main memory bridge at all.
From @Marko
Peter, do you have an idea how these multicore architectures work with memory, anyway?
Not as much as I would like as this issue is not visible to Java.
Does each core have an independent channel?
Each core has an independent channel to the primary cache. For the outer cache there might be a channel for each or 2-6 cache areas but you will a high number of collisions under heavy load.
For the bridge to the main memory, There is one very wide channel. This favours long sequential accesses but is very poor for random accesses. A single thread can max this out with random reads (random enough they don't fit in the outer cache)
Or at least independent as long at there are no collisions?
Once you exhaust the primary cache (L1 typically 32 KB) it's collisions all the way.
Because otherwise scaling is a great issue.
As the OP demonstrates. Most applications either a) spend a significant portion of time waiting for IO b) does allot of computation on small batches of data. Doing allot of computation across large amounts of data is the worst cases senario.
The way I deal with this is to arrange my data structures in memory for sequential access. I use off heap memory which is a pain, but gives you full control of the lay out. (My source data is memory mapped for persistence) I stream the data in with sequential accesses and try to make the most of that data (i.e. I minimise repeated accesses) Even then with 16 cores it is hard to assume all of them will be used efficiently as I have 40 GB of source data I am working on at any one time and about 80 GB of derived data.
Note: High end GPUs address this problem by having incredibly high memory bandwidth. The top end processor can get 250 GB/second whereas a typical CPU is about 4-6 GB/second. Even so they are better suited to vectorised processing and their quoted peak performance is likely to have little memory access e.g. mandelbrot sets.
http://www.nvidia.com/object/tesla-servers.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With