Why does this code run faster with a lock?

Question

Some background: I created a contrived example to demonstrate use of VisualVM to my team. In particular, one method had an unnecessary synchronized keyword, and we saw threads in the thread pool blocking, where they didn't need to be. But removing that keyword had the surprising effect described below, and the code below is the simplest case I can reduce that original example to in order to reproduce the issue, and using a ReentrantLock also creates the same effect.

Please consider the code below (full runnable code example at https://gist.github.com/revbingo/4c035aa29d3c7b50ed8b - you need to add Commons Math 3.4.1 to the classpath). It creates 100 tasks, and submits them to a thread pool of 5 threads. In the task, two 500x500 matrices of random values are created, and then multiplied.

public class Main {
private static ExecutorService exec = Executors.newFixedThreadPool(5);

private final static int MATRIX_SIZE = 500;
private static UncorrelatedRandomVectorGenerator generator = 
            new UncorrelatedRandomVectorGenerator(MATRIX_SIZE, new StableRandomGenerator(new JDKRandomGenerator(), 0.1d, 1.0d));

private static ReentrantLock lock = new ReentrantLock();

public static void main(String[] args) throws Exception {

    for(int i=0; i < 100; i++) {

        exec.execute(new Runnable() {
            @Override
            public void run() {
                double[][] matrixArrayA = new double[MATRIX_SIZE][MATRIX_SIZE];
                double[][] matrixArrayB = new double[MATRIX_SIZE][MATRIX_SIZE];
                for(int j = 0; j< MATRIX_SIZE; j++) {
                    matrixArrayA[j] = generator.nextVector();
                    matrixArrayB[j] = generator.nextVector();
                }

                RealMatrix matrixA = MatrixUtils.createRealMatrix(matrixArrayA);
                RealMatrix matrixB = MatrixUtils.createRealMatrix(matrixArrayB);

                lock.lock();
                matrixA.multiply(matrixB);
                lock.unlock();
            }
        });
    }
}
}

The ReentrantLock is actually unnecessary. There is no shared state between the threads that needs synchronization. With the lock in place, we expectedly observe the threads in the thread pool blocking. With the lock removed, we expectedly observe no more blocking, and all threads able to run fully in parallel.

The unexpected result of removing the lock is that the code consistently takes longer to complete, on my machine (quad-core i7) by 15-25%. Profiling the code shows no indication of any blocking or waiting in the threads, and total CPU usage is only around 50%, spread relatively evenly over the cores.

The second unexpected thing is that this is also dependent on the type of generator that is used. If I use a GaussianRandomGenerator or UniformRandomGenerator instead of the StableRandomGenerator, the expected result is observed - the code runs faster (by around 10%) by removing the lock().

If threads are not blocking, the CPU is at a reasonable level, and there is no IO involved, how can this be explained? The only clue I really have is that the StableRandomGenerator does invoke a lot of trigonometric functions, so is clearly a lot more CPU intensive than the Gaussian or Uniform generators, but why then am I not seeing the CPU being maxed out?

EDIT: Another important point (thanks Joop) - making generator local to the Runnable (i.e. one per thread) displays the normal expected behaviour, where adding the lock slows the code by around 50%. So the key conditions for the odd behaviour are a) using a StableRandomGenerator, and b) having that generator be shared between the threads. But to the best of my knowledge, that generator is thread-safe.

EDIT2: Whilst this question is superficially very similar to the linked duplicate question, and the answer is plausible and almost certainly a factor, I'm yet to be convinced it's quite as simple as that. Things that make me question it:

1) The problem is only shown by synchronizing on the multiply() operation, which does not make any calls to Random. My immediate thought was that that synchronization ends up staggering the threads to some extent, and therefore "accidentally" improves the performance of Random#next(). However, synchronizing on the calls to generator.nextVector() (which in theory has the same effect, in the "proper" way), does not reproduce the issue - synchronizing slows the code as you might expect.

2) The problem is only observed with the StableRandomGenerator, even though the other implementations of NormalizedRandomGenerator also use the JDKRandomGenerator (which as pointed out is just a wrapped for java.util.Random). In fact, I replaced use of the RandomVectorGenerator with filling in the matrices with direct calls to Random#nextDouble, and behaviour again reverts to the expected result - synchronizing any part of the code causes the total throughput to drop.

In summary, the issue can only be observed by

a) using StableRandomGenerator - no other subclass of NormalizedRandomGenerator, nor using the JDKRandomGenerator or java.util.Random directly, display the same behaviour.

b) synchronizing the call to RealMatrix#multiply. The same behaviour is not observed when synchronizing the calls to the random generator.

apangin · Accepted Answer

The same problem as here.

You're actually measuring the contention inside a PRNG with a shared state.

JDKRandomGenerator is based on java.util.Random which has seed shared among all your worker threads. The threads compete to update seed in the compare-and-set loop.

Why lock improves performance then? In fact, it helps to reduce contention inside java.util.Random by serializing the work: while one thread performs matrix multiplication, the other is filling the matrix with random numbers. Without lock threads do the same work concurrently.

DanO · Answer

There is a lot to remember when using random number generators. Long story short, your quirks were caused because the generators have to collect enough entropy before they can give you a random number. By sharing the generator, each call requires entropy to 'fill back up', so it was your blocking point. Now, some generators work differently than others as to how they collect entropy, so some are more effected or chain, rather than build up from scratch. When you make the generators within the instance, each instance builds entropy on its own, so it's faster.

Let me point you to SecureRandom, in particular the class JavaDoc where it says, "Note: Depending on the implementation, the generateSeed and nextBytes methods may block as entropy is being gathered, for example, if they need to read from /dev/random on various unix-like operating systems." This is what you were seeing and why things were slow. Using a single generator, it kept blocking. Yes, it's thread-safe, but it's blocking while getting entropy (note that you were having contention within your threads as they waiting for the blocking methods to return from generating random numbers building entropy, etc). When you put in your own locks, you were giving it time to collect entropy and do its thing in a 'polite' manner. It may be thread safe, but that doesn't mean that it's nice or efficient when bombarded ;-)

Also, for anything using java.util.Random, from Random,

Instances of java.util.Random are threadsafe. However, the concurrent use of the same java.util.Random instance across threads may encounter contention and consequent poor performance. Consider instead using ThreadLocalRandom in multithreaded designs.

Why does this code run faster with a lock?

Tags:

java

performance

RevBingo

2 Answers

apangin

DanO

Recent Activity

Donate For Us

Why does this code run faster with a lock?

Tags:

java

performance

RevBingo

2 Answers

apangin

DanO

Related questions

Recent Activity

Donate For Us