With a single-core processor, where all your threads are run from the one single CPU, the idea of implementing a critical section using an atomic test-and-set operation on some mutex (or semaphore or etc) in memory seems straightforward enough; because your processor is executing a test-and-set from one spot in your program, it necessarily can't be doing one from another spot in your program disguised as some other thread.
But what happens when you do actually have more than one physical processor? It seems that simple instruction level atomicity wouldn't be sufficient, b/c with two processors potentially executing their test-and-set operations at the same time, what you really need to maintain atomicity on is access to the shared memory location of the mutex. (And if the shared memory location is loaded into cache, there's the whole cache consistency thing to deal with, too..)
This seems like it would incur far more overhead than the single core case, so here's the meat of the question: How much worse is it? Is it worse? Do we just live with it? Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?
There are four well-known applications for MCSoC systems: (1) wireless, (2) network, (3) multimedia and (4) mobile applications.
A CPU that offers multiple cores may perform significantly better than a single-core CPU of the same speed. Multiple cores allow PCs to run multiple processes at the same time with greater ease, increasing your performance when multitasking or under the demands of powerful apps and programs.
Multi-core/SMP systems are not just several CPUs glued together. There's explicit support for doing things in parallel. All the synchronization primitives are implemented with the help of hardware along the lines of atomic CAS. The instruction either locks the bus shared by CPUs and memory controller (and devices that do DMA) and updates the memory, or just updates the memory relying on cache snooping. This in turn causes cache coherency algorithm to kick in forcing all involved parties to flush their caches.
Disclaimer - this is very basic description, there are more interesting things here like virtual vs. physical caches, cache write-back policies, memory models, fences, etc. etc.
If you want to know more about how OS might use these hardware facilities - here's an excellent book on the subject.
The vendor of multi-core cpus has to take care that the different cores coordinate themselves when executing instructions which guarantee atomic memory access.
On intel chips for instance you have the 'cmpxchg' instruction. It compares the value stored at a memory location to an expected value and exchanges it for the new value if the two match. If you precede it with the 'lock' instruction, it is guaranteed to be atomic with respect to all cores.
You would need a test-and-set that forces the processor to notify all the other cores of the operation so that they are aware. Yes, that introduces an overhead and you have to live with it. It's a reason to design multithreaded applications in such a way that they don't wait for synchronization primitives too often.
Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?
That would cancel the whole point of multithreading. When you are using a lock, semaphore, or other syncronization techniques, you are relying on OS to make sure that these operations are interlocked, no matter how many cores you are using.
The time to switch to a different thread after a lock has been released is mostly determined by the cost of a context switch. This SO thread deals with the context switching overhead, so you might want to check that.
There are some other interesting threads also:
You should read this MSDN article also: Understanding the Impact of Low-Lock Techniques in Multithreaded Apps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With