Why the 20x ratio Thread sweet spot for IO? [formerly : Which ExecutionContext to use in playframework?]

Question

I do know how to create my own ExecutionContext or to import the play framework global one. But I must admit I am far from being an expert on how multiple context/executionServices would work in the back.

So my question is, for better performance/behaviour of my service which ExecutionContext should I use?

I tested two options:

import play.api.libs.concurrent.Execution.defaultContext

and

implicit val executionContext = ExecutionContext.fromExecutorService(Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors()))

With both resulting in comparable performances.

The action I use is implemented like this in playframework 2.1.x. SedisPool is my own object with extra Future wrapping of a normal sedis/jedis client pool.

def testaction(application: String, platform: String) = Action {
    Async(
      SedisPool.withAsyncClient[Result] { client =>
        client.get(StringBuilder.newBuilder.append(application).append('-').append(platform).toString) match {
          case Some(x) => Ok(x)
          case None => Results.NoContent
        }
      })
  }

This performance-wize behave as good or slightly slower than the exact same function in Node.js, and Go. But still slower than Pypy. But way faster than the same thing in Java (using blocking call to redis using jedis in this case). We load tested with gatling. We were doing a "competition" of techs for simple services on top of redis and the criteria was "with the same amount of efforts from coders". I already tested this using fyrie (and apart from the fact that I do not like the API) it behaved almost the same as this Sedis implementation.

But that's beside my question. I just want to learn more about this part of playframework/scala.

Is there an advised behaviour? Or could someone point me in a better direction? I am starting using scala now, I am far from an expert but I can walk myself through code answers.

Thanks for any help.

UPDATE - More questions!

After tampering with the number of threads in the pool I found out that: Runtime.getRuntime().availableProcessors() * 20

Gives around 15% to 20% performance boost to my service (measured in request per seconds, and by average response time), which actually makes it slightly better than node.js and go (barely though). So I now have more questions : - I tested 15x and 25x and 20 seems to be a sweet spot. Why? Any ideas? - Would there be other settings that might be better? Other "sweet spots"? - Is 20x the sweet spot or is this dependent on other parameters of the machine/jvm I am running on?

UPDATE - More docs on the subject

Found more information on the play framework docs. http://www.playframework.com/documentation/2.1.0/ThreadPools

For IO they do advise something to what I've done but gives a way to do it through Akka.dispatchers that are configurable through *.conf files (this should make my ops happy).

So now I am using

implicit val redis_lookup_context: ExecutionContext = Akka.system.dispatchers.lookup("simple-redis-lookup")

with the dispatcher configured by

akka{
    event-handlers = ["akka.event.slf4j.Slf4jEventHandler"]
    loglevel = WARNING
    actor {
        simple-redis-lookup = {
            fork-join-executor {
                parallelism-factor = 20.0   
                #parallelism-min = 40
                #parallelism-max = 400
            }
        }
    }
}

It gave me around 5% boost (eyeballing it now), and more stability of the performance once the JVM was "hot". And my sysops are happy to play with those settings without rebuilding the service.

My questions are still there though. Why this numbers?

Andrew Alcock · Accepted Answer

The way I think about optimization is to:

Take a look at single threaded performance, then
See how things parallelise, then
Rinse and repeat until you have the performance you need or you give up.

Single threaded optimization

The performance of a single thread will typically be gated on a single component or section of your code, and it might be:

A CPU-bound section, which may actually be bound on reading from RAM (this is not paging). The JVM and higher level tools often cannot distinguish between CPU and RAM. A performance profiler (eg JProfiler) is really useful to locate the code hotspots)
- You can improve performance by optimizing the code to decrease CPU usage or RAM read/write rates
A paging problem, where the application has run out of memory and is paging to or from disk
- You can improve performance by adding RAM, reducing memory usage, allocating more physical RAM to the process or reducing memory load on the OS
A latency problem, where the thread is waiting to read from a socket, disk or similar, or waiting while the data is committed to disk.
- You can improve single-threaded performance by using faster disks (eg spinning rust -> SSD), using a faster network (1GE -> 10GE) or by improving the responsiveness of the network app you are using (tune the DB)

However, latencies in the single thread are not so worrisome if you can run multiple threads. While one thread is blocked, another can use the CPU (for the overhead of swapping out context and replacing most of the items in the CPU cache). So how many threads should you run?

Multi-threading

Let's assume that the thread spends about 50% of the time on the CPU and 50% waiting for IO. In that case, each CPU can be fully utilized by 2 threads, and you see a 2x throughput improvement. If the thread spends about 1% of the time using CPU, you should (all things being equal) be able to run 100 threads concurrently.

However, this is where a lot of weird effects can occur:

Context switching has (some) cost and so ideally you need to minimize them. You will get a greater overall system performance if your periods of latency are few and large rather than frequent and small. This effect means that increasing threads by nx, you will never quite get nx throughput improvement. And after a critical point, as you increase n, so you performance will decrease.
Synchronization, semaphores and mutexes. Often small areas of your code acquire semaphores or mutexes to ensure that only one (or limited number) of threads can enter at any one time. While there are only a few threads, this rarely impacts performance. However, if this code block takes any appreciable time, and there are many threads, this will become the gating factor for system performance. For example, imagine a guarded, single-threaded block that takes 10ms to execute, for example by querying the database. Because only one thread at a time can enter, the max threads you can have actually executing is 1000ms/10ms, or 100. All other threads will end up behind each other in a queue on this block.
Resources: As you increase parallelism, you are loading all manner of previously lightly loaded components. As these become more heavily loaded, so other threads end up blocked waiting on data from them. Ultimately, the extra parallelism ends up creating latency in all threads on the computer. These components include:
1. RAM
2. Disk channels
3. Network
4. Network services (such as your DB). I can't tell you how many times I have optimized Java to the point that the DB is limiting throughput.

If this happens, then you need to either rethink you algorithm, change the server, network or network services or decrease parallelism.

Factors that affect how many threads you can run

From the above, you can see that there a metric ton of factors involved. As a result, the sweet spot of threads/core is an accident of multiple causes, including:

The performance of the CPU you use, especially:
- Number of cores
- SMT or not SMT
- Amount of cache
- Speed
How much RAM you have and the speed of the memory bus
The operating system and environment:
- How much other work is being executed on the processors
- Windows/Linux/BSD/etc all have different multitasking characteristics
- The JVM version (each version has different characteristics, some more different than others)
- Traffic and congestion on the network and the effect on switches and routers involved
Your code
- Your algorithm
- The libraries you use

From experience, there is no magic formula to compute a priori the best number of threads. This problem is best tackled empirically (as I show above), just as you have done. If you need to generalize, you will need sampling of performance over different CPU architectures, memory and networks on the operating system of your choice.

Several easily observed metrics are useful here:

CPU utilization per core - to help detect if the process is CPU bound or not
Load average - this reports how may processes (or threads if using LWP) are waiting for the CPU. If this creeps up to a figure larger than the number of CPU cores, your CPU cores you are definitely CPU bound.

If you need to optimize, get the best profiling tools you can. You would need a specific tool for monitoring the operating system (eg DTrace for Solaris), and one for the JVM (I personally love JProfiler). These tools will allow you to zoom in on precisely the areas I describe above.

Conclusions

It happens that your particular code, on the particular Scala library version, JVM version, OS, server and Redis server, run so that each thread is waiting for I/O about 95% of the time. (If running single threaded, you'd find the CPU load to be about 5%).

This allows about 20 threads to share each CPU optimally in this configuration.

This is the sweet spot because:

If you have fewer threads running, you will be wasting CPU cycles waiting for data
If you run more threads either:
- One component of your architecture saturates (eg a disk or your CPU<->RAM bus) blocking additional throughput (in which case you'd see CPU utilization to be lower or much lower than ~90%), or
- The thread context switch cost starts to exceed the incremental gain of adding threads (and you would see CPU utilization hit > ~95%)

Stephane Landelle · Answer

Have you tried changing your thread pool:

use a CachedThreadPool instead of a FixedThreadPool, just so that you get an idea of how much you ThreadPool might grow, before caping it
use more than 1 thread per core, maybe 2?

Why the 20x ratio Thread sweet spot for IO? [formerly : Which ExecutionContext to use in playframework?]

Tags:

java

jvm

scala

playframework

playframework-2.1

UPDATE - More questions!

UPDATE - More docs on the subject

le-doude

2 Answers

Andrew Alcock

Stephane Landelle

Recent Activity

Donate For Us

Why the 20x ratio Thread sweet spot for IO? [formerly : Which ExecutionContext to use in playframework?]

Tags:

java

jvm

scala

playframework

playframework-2.1

UPDATE - More questions!

UPDATE - More docs on the subject

le-doude

2 Answers

Andrew Alcock

Stephane Landelle

Related questions

Recent Activity

Donate For Us