Even faster inexpensive thread-safe counter?

Q: How to achieve thread safety in multithreaded applications?

In most cases, errors in multithreaded applications are the result of incorrectly sharing state between several threads. Therefore, the first approach that we'll look at is to achieve thread-safety using stateless implementations.

Q: What is thread safety in C++?

In multithreaded environments, we need to write implementations in a thread-safe way. This means that different threads can access the same resources without exposing erroneous behavior or producing unpredictable results. This programming methodology is known as “thread-safety”.

Q: Is factorial () method thread safe?

Hence, it's considered to be thread-safe and can be safely called by multiple threads at the same time. All threads can safely call the factorial () method and will get the expected result without interfering with each other and without altering the output that the method generates for other threads.

Q: How to achieve thread-safety in Java?

Using an atomic variable is another way to achieve thread-safety in java. When variables are shared by multiple threads, the atomic variable ensures that threads don’t crash into each other.

Tags:

c#

multithreading

cpu-cache

interlocked

mesi

I've read this topic: C# Thread safe fast(est) counter and have implemented this feature in my parallel code. As far as I can see it all works fine, however it has measurably increased the processing time, as in 10% or so.

It's been bugging me a bit, and I think the problem lies in the fact that I'm doing a huge number of relatively cheap (<1 quantum) tasks on small data fragments which are well partioned and probably make good use of cache locality, thus running optimally. My best guess, based on what little I know about MESI, is that x86 LOCK prefix in Interlocked.Increment pushes cacheline into Exclusive mode and forces a cache miss on other cores and forces cache reload on every single parallel pass just for the sake of incrementing this counter. With 100ns-ish delay for cache miss and my workload it seems to add up. (Then again, I could be wrong)

Now, I don't see a way around it, but maybe I am missing something obvious. I was even thinking about using n counters (corresponding to degree of parallelization) and then incrementing each on specific core, however it seems unfeasible (detecting which core I am on will probably be more expensive, not to mention an elaborate if/then/else structure and messing up with the execution pipeline). Any ideas on how to break this beast? :)

882

asked Jul 17 '15 17:07

mmix

2 Answers

Operations from multiple cores on the same cache line contend in hardware. This is true for locked and for regular memory access. This is a real problem. Contending accesses do not scale at all when more cores are added. Scaling typically is hard negative.

You need to use multiple cache lines with each core using its own most of the time.

You could use a ThreadLocal<Holder> and class Holder { public int I; } for that. ThreadLocal supports enumerating all instances that have been created so that you can sum them. You also could use a struct that is padded to the cache line size. That's safer.

Note, that it is not important to use one counter per core. Per-thread is good enough because time quantums are incredibly long compared to increment operations. A few bad accesses are not a performance problem.

A faster option would be to use a Holder[]. Each thread draws a random array index once and then accesses that holder object. Array indexing is faster than thread local access. If the number of holder instances you use is much bigger (10x) than the number of threads there will be little contention. Most writes will go the same already cached line.

Instead of a random index you can use a List<Holder> and add items as more threads join the processing.

answered Sep 28 '22 21:09

usr

I thought I would offer some clarification about cache coherence and what the LOCK prefix does in Intel architectures. Since it's too long for a comment and also answers some points your raised I think it's appropriate to post as an answer.

In the MESI cache coherence protocol, any write to a cache line will cause the state to change to exclusive, regardless of whether you are using the LOCK prefix or not. So if two processors are both accessing the same cache line repeatedly, and at least 1 of the processors is doing writes, then the processors will experience cache line misses when accessing the line that they are sharing. Whereas if they were both only reading from the line then they would have cache line hits because they could both keep the line in their private L1 cache in the shared state.

What the LOCK prefix does is restrict the amount of speculative work that a processor can do while waiting for the the locked instruction to finish executing. Section 8.1.2 of the Intel 64 and IA-32 Architectures Software Developer’s Manual says:

Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor.

Under normal circumstances the processor is able to speculatively execute instructions while waiting for a cache miss to be resolved. But the LOCK prefix prevents this and essentially stalls the pipeline until the locked instruction finished execution.

answered Sep 28 '22 22:09

Gabriel Southern

Related questions
                            
                                How to load dll's during debug in VS2013
                            
                                Memory Mapped File gets deleted from memory
                            
                                Can't find embedded resource using GetManifestResourceStream
                            
                                How to I create a unit test for a Visual Studio 2015 RC / .NET Core project?
                            
                                Bundling files in different directories?
                            
                                Dependency Property in F# Default Value does not match
                            
                                How do I change the format of a decimal number in EpPlus?
                            
                                MVC5 and EF6.1.3 Scaffolding problems
                            
                                Can you use equals assignment when removing delegate members in a disposing method?
                            
                                How to get the exception from HttpResponseMessage in web API?
                            
                                Unintended consequences when changing next line of execution in Visual Studio [duplicate]
                            
                                Owin Websockets - Understanding IOwinContext and WebSocketAccept
                            
                                Break point is not hitting in javascript file
                            
                                Dynamics CRM SDK in C# connect with Invalid Password
                            
                                Copy in-memory SQLite Database to make unit tests faster
                            
                                How to tell JSON.NET to deserialize JArray to a List<object> when object type is provided?
                            
                                Call oracle Stored Procedure with EF fail
                            
                                Create PDF from HTML snippet using PdfSharp having external CSS classes included in the HTML
                            
                                ASP.NET - two types of users that register in one registration form?
                            
                                Is there an interface in C# for interval-scaled values?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With