CAS vs synchronized performance

Tags:

I've had this question for quite a while now, trying to read lots of resources and understanding what is going on - but I've still failed to get a good understanding of why things are the way they are.

Simply put I'm trying to test how a CAS would perform vs synchronized in contended and not environments. I've put up this JMH test:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@State(Scope.Benchmark)
public class SandBox {

    Object lock = new Object();

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder().include(SandBox.class.getSimpleName())
                .jvmArgs("-ea", "-Xms10g", "-Xmx10g")
                .shouldFailOnError(true)
                .build();
        new Runner(opt).run();
    }

    @State(Scope.Thread)
    public static class Holder {

        private long number;

        private AtomicLong atomicLong;

        @Setup
        public void setUp() {
            number = ThreadLocalRandom.current().nextLong();
            atomicLong = new AtomicLong(number);
        }
    }

    @Fork(1)
    @Benchmark
    public long sync(Holder holder) {
        long n = holder.number;
        synchronized (lock) {
            n = n * 123;
        }

        return n;
    }

    @Fork(1)
    @Benchmark
    public AtomicLong cas(Holder holder) {
        AtomicLong al = holder.atomicLong;
        al.updateAndGet(x -> x * 123);
        return al;
    }

    private Object anotherLock = new Object();

    private long anotherNumber = ThreadLocalRandom.current().nextLong();

    private AtomicLong anotherAl = new AtomicLong(anotherNumber);

    @Fork(1)
    @Benchmark
    public long syncShared() {
        synchronized (anotherLock) {
            anotherNumber = anotherNumber * 123;
        }

        return anotherNumber;
    }

    @Fork(1)
    @Benchmark
    public AtomicLong casShared() {
        anotherAl.updateAndGet(x -> x * 123);
        return anotherAl;
    }

    @Fork(value = 1, jvmArgsAppend = "-XX:-UseBiasedLocking")
    @Benchmark
    public long syncSharedNonBiased() {
        synchronized (anotherLock) {
            anotherNumber = anotherNumber * 123;
        }

        return anotherNumber;
    }

}

And the results:

Benchmark                                           Mode  Cnt     Score      Error  Units
spinLockVsSynchronized.SandBox.cas                  avgt    5   212.922 ±   18.011  ns/op
spinLockVsSynchronized.SandBox.casShared            avgt    5  4106.764 ± 1233.108  ns/op
spinLockVsSynchronized.SandBox.sync                 avgt    5  2869.664 ±  231.482  ns/op
spinLockVsSynchronized.SandBox.syncShared           avgt    5  2414.177 ±   85.022  ns/op
spinLockVsSynchronized.SandBox.syncSharedNonBiased  avgt    5  2696.102 ±  279.734  ns/op

In the non-shared case CASis by far faster, which I would expect. But in shared case, things are the other way around - and this I can't understand. I don't think this is related to biased locking, as that would happen after a threads holds the lock for 5 seconds (AFAIK) and this does not happen and the test is just proof of that.

I honestly hope it's just my tests that are wrong, and someone having jmh expertise would come along and just point me to the wrong set-up here.

942

asked Sep 07 '17 09:09

Eugene

2 Answers

The main misconception is the assumption that you are comparing “CAS vs. synchronized”. Given, how sophisticated JVMs implement synchronized, you are comparing the performance of a CAS-based algorithm using AtomicLong with the performance of the CAS-based algorithm used to implement synchronized.

Similar to Lock, the internal information for an object monitor basically consist of an int status telling whether it has been owned and how often it is nested, a reference to the current owner thread and a queue of threads waiting to be able to acquire it. The expensive aspect is the waiting queue. Putting a thread into the queue, removing it from thread scheduling, and eventually waking it up when the current owner releases the monitor, are operations that can take a significant time.

However, in the uncontended case, the waiting queue is, of course, not involved. Acquiring the monitor consist of a single CAS to change the status from “unowned” (usually zero) to “owned, acquired once” (guess the typical value). If successful, the thread can proceed with the critical action, followed by a release which implies just writing the “unowned” state with the necessary memory visibility and waking up another blocked thread, if there is one.

Since the wait queue is the significantly more expensive thing, implementations usually try to avoid it even in the contended case by performing some amount of spinning, making several repeated CAS attempts before falling back to enqueuing the thread. If the critical action of the owner is as simple as a single multiplication, chances are high that the monitor will be released during the spinning phase already. Note that synchronized is “unfair”, allowing a spinning thread to proceed immediately, even if there are already enqueued threads waiting far longer.

If you compare the fundamental operations performed by the synchronized(lock){ n = n * 123; } when no queuing is involved and by al.updateAndGet(x -> x * 123);, you’ll notice that they are roughly on par. The main difference is that the AtomicLong approach will repeat the multiplication on contention while for the synchronized approach, there is a risk of being put into the queue if no progress has been made during spinning.

But synchronized allows lock coarsening for code repeatedly synchronizing on the same object, which might be relevant for a benchmark loop calling the syncShared method. Unless there’s also a way to fuse multiple CAS updates of an AtomicLong, this can give synchronized a dramatic advantage. (See also this article covering several aspects discussed above)

Note that due to the “unfair” nature of synchronized, creating far more threads than CPU cores doesn’t have to be a problem. In the best case, “number of threads minus number of cores” threads end up on the queue, never waking up, while the remaining threads succeed in the spinning phase, one thread on each core. But likewise, threads not running on a CPU core can’t reduce the performance of the AtomicLong update as they can neither, invalidate the current value to other threads nor make a failed CAS attempt.

In either case, when CASing on the member variable of an unshared object or when performing synchronized on an unshared object, the JVM may detect the local nature of the operation and elide most of the associated costs. But this may depend on several subtle environmental aspects.

The bottom line is that there is no easy decision between atomic updates and synchronized blocks. Things get far more interesting with more expensive operations, which may raise the likelihood of threads getting enqueued in the contended case of synchronized, which may make it acceptable that the operation has to be repeated in the contended case of an atomic update.

152

answered Nov 12 '22 20:11

Holger

You should read, re-read, and accept @Holger's excellent answer, as the insights it provides are far more valuable than a single set of benchmark numbers from one developer's workstation.

I tweaked your benchmarks to make them a bit more apples-to-apples, but if you read @Holger's answer, you'll see why this isn't a terribly useful test. I'm going to include my changes and my results simply to show how results can vary from one machine (or one JRE version) to another.

First, my version of benchmarks:

@State(Scope.Benchmark)
public class SandBox {
    public static void main(String[] args) throws RunnerException {
        new Runner(
            new OptionsBuilder().include(SandBox.class.getSimpleName())
                                .shouldFailOnError(true)
                                .mode(Mode.AverageTime)
                                .timeUnit(TimeUnit.NANOSECONDS)
                                .warmupIterations(5)
                                .warmupTime(TimeValue.seconds(5))
                                .measurementIterations(5)
                                .measurementTime(TimeValue.seconds(5))
                                .threads(-1)
                                .build()
        ).run();
    }

    private long number = 0xCAFEBABECAFED00DL;
    private final Object lock = new Object();
    private final AtomicLong atomicNumber = new AtomicLong(number);

    @Setup(Level.Iteration)
    public void setUp() {
        number = 0xCAFEBABECAFED00DL;
        atomicNumber.set(number);
    }

    @Fork(1)
    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public long casShared() {
        return atomicNumber.updateAndGet(x -> x * 123L);
    }

    @Fork(1)
    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public long syncShared() {
        synchronized (lock) {
            return number *= 123L;
        }
    }

    @Fork(value = 1, jvmArgsAppend = "-XX:-UseBiasedLocking")
    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public long syncSharedNonBiased() {
        synchronized (lock) {
            return number *= 123L;
        }
    }
}

And then my first batch of results:

# VM version: JDK 1.8.0_60, VM 25.60-b23

Benchmark                    Mode  Cnt     Score     Error  Units
SandBox.casShared            avgt    5   976.215 ± 167.865  ns/op
SandBox.syncShared           avgt    5  1820.554 ±  91.883  ns/op
SandBox.syncSharedNonBiased  avgt    5  1996.305 ± 124.681  ns/op

Recall that you saw synchronized coming out ahead under high contention. On my workstation, the atomic version fared better. If you use my version of your benchmarks, what results do you see? It won't surprise me in the least if they're substantially different.

Here's another set run under a months-old Java 9 EA release:

# VM version: JDK 9-ea, VM 9-ea+170

Benchmark                    Mode  Cnt     Score     Error  Units
SandBox.casShared            avgt    5   979.615 ± 135.495  ns/op
SandBox.syncShared           avgt    5  1426.042 ±  52.971  ns/op
SandBox.syncSharedNonBiased  avgt    5  1649.868 ±  48.410  ns/op

The difference is less dramatic here. It's not terribly unusual to see a difference across major JRE versions, but who's to say you won't see them across minor releases too?

At the end of the day, the results are close. Very close. The performance of synchronized has come a long way since the early Java versions. If you are not writing HFT algorithms or something else that's incredibly latency sensitive, you should prefer the solution that's most easily proven correct. It is generally easier to reason about synchronized than lock-free algorithms and data structures. If you cannot demonstrate a measurable difference in your application, then synchronized is what you should use.

answered Nov 12 '22 22:11

Mike Strobel

Related questions
                            
                                Why does Map<K,V> not extends Function<K,V>?
                            
                                Why my pattern("yyyyMM") cannot parse with DateTimeFormatter (java 8)
                            
                                How do I split a stream to handle elements in different ways?
                            
                                Why should I use concurrent characteristic in parallel stream with collect?
                            
                                Java 8 catch 22 with lambda expressions and effectively final
                            
                                In the reskinned Java SE 8 docs, what is the difference between concrete and instance methods?
                            
                                New Date & Time API in Java 8
                            
                                Why Comparator.comparing doesn't work with String::toLowerCase method reference?
                            
                                Java 8 Stream filtering value of list in a list
                            
                                Find element matching in 2 lists using java 8 stream
                            
                                Convert List of Maps to single Map via streams
                            
                                How to join two Optional<String> with a delimiter in Java 8
                            
                                Java Stream Collectors.toMap value is a Set
                            
                                Using NetBeans 8 but getting below compilation error for Lambda expression
                            
                                Filter Null items in Stream
                            
                                ReplaceAll with java8 lambda functions
                            
                                How to populate a HashMap using a Lambda Expression
                            
                                How to iterate over lambda functions in Java
                            
                                Logging inside Stream
                            
                                Java 8 Stream vs Collection Storage

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CAS vs synchronized performance

Tags:

atomic

java-8

compare-and-swap

jmh

atomic-long

Eugene

People also ask

2 Answers

Holger

Mike Strobel

Recent Activity

Donate For Us