Implement a high performance mutex similar to Qt's one

Tags:

I have a multi-thread scientific application where several computing threads (one per core) have to store their results in a common buffer. This requires a mutex mechanism.

Working threads spend only a small fraction of their time writing to the buffer, so the mutex is unlocked most of the time, and locks have a high probability to succeed immediately without waiting for another thread to unlock.

Currently, I have used Qt's QMutex for the task, and it works well : the mutex has a negligible overhead.

However, I have to port it to c++11/STL only. When using std::mutex, the performance drops by 66% and the threads spend most of their time locking the mutex.

After another question, I figured that Qt uses a fast locking mechanism based on a simple atomic flag, optimized for cases where the mutex is not already locked. And falls back to a system mutex when concurrent locking occurs.

I would like to implement this in STL. Is there a simple way based on std::atomic and std::mutex ? I have digged in Qt's code but it seems overly complicated for my use (I do not need locks timeouts, pimpl, small footprint etc...).

Edit : I have tried a spinlock, but this does not work well because :

Periodically (every few seconds), another thread locks the mutexes and flushes the buffer. This takes some time, so all worker threads get blocked at this time. The spinlocks make the scheduling busy, causing the flush to be 10-100x slower than with a proper mutex. This is not acceptable

Edit : I have tried this, but it's not working (locks all threads)

class Mutex
{
public:
    Mutex() : lockCounter(0) { }

    void lock()
    {
        if(lockCounter.fetch_add(1, std::memory_order_acquire)>0)
        {
            std::unique_lock<std::mutex> lock(internalMutex);
            cv.wait(lock);
        }
    }

    void unlock();
    {
        if(lockCounter.fetch_sub(1, std::memory_order_release)>1)
        {
            cv.notify_one();
        }
    }


private:
    std::atomic<int> lockCounter;
    std::mutex internalMutex;
    std::condition_variable cv;
};

Thanks!

Edit : Final solution

MikeMB's fast mutex was working pretty well.

As a final solution, I did:

Use a simple spinlock with a try_lock
When a thread fails to try_lock, instead of waiting, they fill a queue (which is not shared with other threads) and continue
When a thread gets a lock, it updates the buffer with the current result, but also with the results stored in the queue (it processes its queue)
The buffer flushing was made much more efficiently : the blocking part only swaps two pointers.

330

asked Mar 22 '15 10:03

galinette

1 Answers

General Advice

As was mentioned in some comments, I'd first have a look, whether you can restructure your program design to make the mutex implementation less critical for your performance .
Also, as multithreading support in standard c++ is pretty new and somewhat infantile, you sometimes just have to fall back on platform specific mechanisms, like e.g. a futex on linux systems or critical sections on windows or non-standard libraries like Qt.
That being said, I could think of two implementation approaches that might potentially speed up your program:

Spinlock
If access collisions happen very rarely, and the mutex is only hold for short periods of time (two things one should strive to achieve anyway of course), it might be most efficient to just use a spinlock, as it doesn't require any system calls at all and it's simple to implement (taken from cppreference):

class SpinLock {
    std::atomic_flag locked ;
public:
    void lock() {
        while (locked.test_and_set(std::memory_order_acquire)) { 
             std::this_thread::yield(); //<- this is not in the source but might improve performance. 
        }
    }
    void unlock() {
        locked.clear(std::memory_order_release);
    }
};

The drawback of course is that waiting threads don't stay asleep and steal processing time.

Checked Locking

This is essentially the idea you demonstrated: You first make a fast check, whether locking is actually needed based on an atomic swap operation and use a heavy std::mutex only if it is unavoidable.

struct FastMux {
    //Status of the fast mutex
    std::atomic<bool> locked;
    //helper mutex and vc on which threads can wait in case of collision
    std::mutex mux;
    std::condition_variable cv;
    //the maximum number of threads that might be waiting on the cv (conservative estimation)
    std::atomic<int> cntr; 

    FastMux():locked(false), cntr(0){}

    void lock() {
        if (locked.exchange(true)) {
            cntr++;
            {
                std::unique_lock<std::mutex> ul(mux);
                cv.wait(ul, [&]{return !locked.exchange(true); });
            }
            cntr--;
        }
    }
    void unlock() {
        locked = false;
        if (cntr > 0){
            std::lock_guard<std::mutex> ul(mux);
            cv.notify_one();
        }
    }
};

Note that the std::mutex is not locked in between lock() and unlock() but it is only used for handling the condition variable. This results in more calls to lock / unlock if there is high congestion on the mutex.

The problem with your implementation is, that cv.notify_one(); can potentially be called between if(lockCounter.fetch_add(1, std::memory_order_acquire)>0) and cv.wait(lock); so your thread might never wake up.

I didn't do any performance comparisons against a fixed version of your proposed implementation though so you just have to see what works best for you.

107

answered Oct 22 '22 00:10

MikeMB

Related questions
                            
                                How do I specify a default value for vector<string> in boost::program_options
                            
                                boost::program_options : iterating over and printing all options
                            
                                How to reduce output size of template-heavy C++ code?
                            
                                What is the best solution to pause and resume pthreads?
                            
                                How is iterating through a hashtable implemented?
                            
                                QTabWidget tabs on the vertical, but text in horizontal
                            
                                C++ calculating more precise than double or long double
                            
                                Josephus sequence
                            
                                Get screen coordinates of DOM element
                            
                                Distance between 2 hexagons on hexagon grid
                            
                                The relation between Forward declaration and destructors
                            
                                remove gradient of a image without a comparison image
                            
                                Class template specialization in class scope?
                            
                                Floating-point optimizations - guideline
                            
                                Why does C++ not allow multiple types in one auto statement?
                            
                                How can an std::ostream be moved?
                            
                                Sync is unreliable using std::atomic and std::condition_variable
                            
                                Bad optimization of std::fabs()?
                            
                                Match type of inherited member functions
                            
                                Does std::addressof negate the STL operator& requirement?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Implement a high performance mutex similar to Qt's one

Tags:

c++

c++11

stl

mutex

qt

galinette

People also ask

1 Answers

MikeMB

Recent Activity

Donate For Us