Conventional wisdom tells us that high-volume enterprise java applications should use thread pooling in preference to spawning new worker threads. The use of java.util.concurrent
makes this straightforward.
There do exist situations, however, where thread pooling is not a good fit. The specific example which I am currently wrestling with is the use of InheritableThreadLocal
, which allows ThreadLocal
variables to be "passed down" to any spawned threads. This mechanism breaks when using thread pools, since the worker threads are generally not spawned from the request thread, but are pre-existing.
Now there are ways around this (the thread locals can be explicitly passed in), but this isn't always appropriate or practical. The simplest solution is to spawn new worker threads on demand, and let InheritableThreadLocal
do its job.
This brings us back to the question - if I have a high volume site, where user request threads are spawning off half a dozen worker threads each (i.e. not using a thread pool), is this going to give the JVM a problem? We're potentially talking about a couple of hundred new threads being created every second, each one lasting less than a second. Do modern JVMs optimize this well? I remember the days when object pooling was desirable in Java, because object creation was expensive. This has since become unnecessary. I'm wondering if the same applies to thread pooling.
I'd benchmark it, if I knew what to measure, but my fear is that the problems may be more subtle than can be measured with a profiler.
Note: the wisdom of using thread locals is not the issue here, so please don't suggest that I not use them.
As you can see, creating a new thread only costs ~70 µs. This could be considered trivial in many, if not most, use cases. Relatively speaking it is more expensive than the alternatives and for some situations a thread pool or not using threads at all is a better solution. That's a great piece of code there.
1 Answer. Java thread creation is expensive because there is a fair bit of work involved: A large block of memory has to be allocated and initialized for the thread stack. System calls need to be made to create / register the native thread with the host OS.
Multithreading still induces high virtualization overhead, mainly caused by synchronization, spinning at user level and NUMA management. The overhead is diverse in nature and embodiment as it is a function of many system and workload properties. System-level solutions are feasible, but often imply difficult trade-offs.
Creating a thread is expensive, and the stack requires memory. As well, if your process is using many threads, then context switching can kill performance.
Here is an example microbenchmark:
public class ThreadSpawningPerformanceTest { static long test(final int threadCount, final int workAmountPerThread) throws InterruptedException { Thread[] tt = new Thread[threadCount]; final int[] aa = new int[tt.length]; System.out.print("Creating "+tt.length+" Thread objects... "); long t0 = System.nanoTime(), t00 = t0; for (int i = 0; i < tt.length; i++) { final int j = i; tt[i] = new Thread() { public void run() { int k = j; for (int l = 0; l < workAmountPerThread; l++) { k += k*k+l; } aa[j] = k; } }; } System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms."); System.out.print("Starting "+tt.length+" threads with "+workAmountPerThread+" steps of work per thread... "); t0 = System.nanoTime(); for (int i = 0; i < tt.length; i++) { tt[i].start(); } System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms."); System.out.print("Joining "+tt.length+" threads... "); t0 = System.nanoTime(); for (int i = 0; i < tt.length; i++) { tt[i].join(); } System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms."); long totalTime = System.nanoTime()-t00; int checkSum = 0; //display checksum in order to give the JVM no chance to optimize out the contents of the run() method and possibly even thread creation for (int a : aa) { checkSum += a; } System.out.println("Checksum: "+checkSum); System.out.println("Total time: "+totalTime*1E-6+" ms"); System.out.println(); return totalTime; } public static void main(String[] kr) throws InterruptedException { int workAmount = 100000000; int[] threadCount = new int[]{1, 2, 10, 100, 1000, 10000, 100000}; int trialCount = 2; long[][] time = new long[threadCount.length][trialCount]; for (int j = 0; j < trialCount; j++) { for (int i = 0; i < threadCount.length; i++) { time[i][j] = test(threadCount[i], workAmount/threadCount[i]); } } System.out.print("Number of threads "); for (long t : threadCount) { System.out.print("\t"+t); } System.out.println(); for (int j = 0; j < trialCount; j++) { System.out.print((j+1)+". trial time (ms)"); for (int i = 0; i < threadCount.length; i++) { System.out.print("\t"+Math.round(time[i][j]*1E-6)); } System.out.println(); } } }
The results on 64-bit Windows 7 with 32-bit Sun's Java 1.6.0_21 Client VM on Intel Core2 Duo E6400 @2.13 GHz are as follows:
Number of threads 1 2 10 100 1000 10000 100000 1. trial time (ms) 346 181 179 191 286 1229 11308 2. trial time (ms) 346 181 187 189 281 1224 10651
Conclusions: Two threads do the work almost twice as fast as one, as expected since my computer has two cores. My computer can spawn nearly 10000 threads per second, i. e. thread creation overhead is 0.1 milliseconds. Hence, on such a machine, a couple of hundred new threads per second pose a negligible overhead (as can also be seen by comparing the numbers in the columns for 2 and 100 threads).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With