I've been looking at the Martin Thompson article. Which is an explanation of false sharing.
http://mechanical-sympathy.blogspot.co.uk/2011/07/false-sharing.html
public final class FalseSharing
implements Runnable
{
public final static int NUM_THREADS = 4; // change
public final static long ITERATIONS = 500L * 1000L * 1000L;
private final int arrayIndex;
private static VolatileLong[] longs = new VolatileLong[NUM_THREADS];
static
{
for (int i = 0; i < longs.length; i++)
{
longs[i] = new VolatileLong();
}
}
public FalseSharing(final int arrayIndex)
{
this.arrayIndex = arrayIndex;
}
public static void main(final String[] args) throws Exception
{
final long start = System.nanoTime();
runTest();
System.out.println("duration = " + (System.nanoTime() -start));
}
private static void runTest() throws InterruptedException
{
Thread[] threads = new Thread[NUM_THREADS];
for (int i = 0; i < threads.length; i++)
{
threads[i] = new Thread(new FalseSharing(i));
}
for (Thread t : threads)
{
t.start();
}
for (Thread t : threads)
{
t.join();
}
}
public void run()
{
long i = ITERATIONS + 1;
while (0 != --i)
{
longs[arrayIndex].value = i;
}
}
public final static class VolatileLong
{
public volatile long value = 0L;
public long p1, p2, p3, p4, p5, p6; // comment out
}
}
The example demonstrates the slow down experienced by multiple threads invalidating the cache line of each other even though there each only updating one variable exclusively.
BlockqFigure 1. above illustrates the issue of false sharing. A thread running on core 1 wants to update variable X while a thread on core 2 wants to update variable Y. Unfortunately these two hot variables reside in the same cache line. Each thread will race for ownership of the cache line so they can update it. If core 1 gets ownership then the cache sub-system will need to invalidate the corresponding cache line for core 2. When Core 2 gets ownership and performs its update, then core 1 will be told to invalidate its copy of the cache line. This will ping pong back and forth via the L3 cache greatly impacting performance. The issue would be further exacerbated if competing cores are on different sockets and additionally have to cross the socket interconnect.
My question is the following. If all the variables being updated are volatile, why does this padding cause a performance increase? My understanding is that a volatile variable always writes and reads through to main memory. Therefore I'd assume that every write and read to any variable in this example will result in a flush of the current cores cache line.
So according to my understanding. If thread one invalidates thread two's cacheline, this will not become apparant to thread two until it goes to read a value from its own cache line. The value it's reading is a volatile value so this effectively renders the cache dirty anyway resulting in a read from main memory.
Where have I gone wrong in my understanding?
Thanks
False sharing occurs when threads on different processors modify variables that reside on the same cache line.
In general, false sharing can be reduced using the following techniques: Make use of private or threadprivate data as much as possible. Use the compiler's optimization features to eliminate memory loads and stores. Pad data structures so that each thread's data resides on a different cache line.
"false sharing" is something that happens in (some) cache systems when two threads (or rather two cores) writes to two different variables that belongs to the same cache line.
Unlike synchronized methods or blocks, it does not make other threads wait while one thread is working on a critical section. Therefore, the volatile keyword does not provide thread safety when non-atomic operations or composite operations are performed on shared variables.
If all the variables being updated are volatile, why does this padding cause a performance increase?
So there are two things going on here:
VolatileLong
objects with each thread working on their own VolatileLong
. (See private final int arrayIndex
).VolatileLong
object has a single volatile
field.The volatile
access means that the threads have to both invalidate the cache "line" that holds their volatile long value
and they need to lock that cache line to update it. As the article states, a cache line is typically ~64 bytes or so.
The article is saying that by adding padding to the VolatileLong
object, it moves the object that each of the threads is locking into different cache lines. So even though the different threads are still crossing memory barriers as they assign their volatile long value
, they are in a different cache line an so won't cause excessive L2 cache bandwidth.
In summary, the performance increase happens because even though the threads are still locking their cache line to update the volatile
field, these locks are now on different memory blocks and so they are not clashing with the other threads' locks and causing cache invalidations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With