Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When is std::shared_timed_mutex slower than std::mutex and when (not) to use it?

I am trying to implement a multithreaded LRU cache in C++ using this article as a hint or inspiration. It is for Go, but the concepts required more or less exist in C++ too. This article proposes to use fine-grained locking with shared mutexes around a hash table and a linked list.

So I intended to write a cache using std::unordered_map, std::list and locking with std::shared_timed_mutex. My use case includes several threads (4-8) using this cache as a storage for misspelled words and corresponding possible corrections. The size of the cache would be around 10000-100000 items.

But I read in several places that it rarely makes sense to use a shared mutex instead of a plain one and that it's slower, though I couldn't find some real benchmarks with numbers or at least vague guidelines when to use and when not to use a shared mutex. While other sources propose to use a shared mutex any time you have concurrent readers which more or less outnumber concurrent writers.

  1. When is it better to use an std::shared_timed_mutex than a plain std::mutex? How many times should readers/reads outnumber writers/writes? Of course I get that it depends on many factors, but how should I make a decision which one to use?
  2. Maybe it's platform-dependent and some platform implementations are worse than others? (we use Linux and Windows as targets, MSVC 2017 and GCC 5)
  3. Does it make sense to implement cache locking as described in the article?
  4. Does std::shared_mutex (from C++17) make any difference in performance compared to a timed one?

P.S. I feel there will be "measure/profile first what fits your case best". I would, but I need to implement one first and it would be great if there existed some heuristics to choose instead of implementing both options and measuring. Also even if I measure, I think the outcome will depend on the data which I use. And it can be hard to predict real data (e.g. for a server in a cloud).

like image 787
Roman Kruglov Avatar asked Jun 21 '18 15:06

Roman Kruglov


2 Answers

  1. When is it better to use an std::shared_timed_mutex than a plain std::mutex? How many times should readers/reads outnumber writers/writes? Of course I get that it depends on many factors, but how should I make a decision which one to use?

Because of their extra complexity, cases where read/writer locks (std::shared_mutex, std::shared_timed_mutex) are superior to plain lock (std::mutex, std::timed_mutex) are rare. They do exist, but personally, I never encountered one myself.

Read/writer mutex will not improve the performance if you have frequent, but short read operations. It is better suited for scenarios were read operations are frequent and expensive. When the read operation is only a lookup in an in-memory data structure, most likely a simple lock will outperform the read/writer solution.

If the read operations are very costly and you can process many in parallel, increasing the read vs write ratio should at some point lead to a situation where read/writer will outperform an exclusive lock. Where that breaking point is depends on the real workload. I am not aware of a good rule of thumb.

Also note that performing expensive operations while holding a lock is often a bad sign. There may be better ways to solve the problem then using a read/writer lock.

Two comments on the topic from people that have far more knowledge in that field than myself:

  • Howard Hinnant's answering C++14 shared_timed_mutex VS C++11 mutex
  • Anthony Williams' quote can be found at the end of this answer (unfortunately, the link to this original post seems to be dead). He explains why read/writer locks are slow, and are often not the ideal solution.
  1. Maybe it's platform-dependent and some platform implementations are worse than others? (we use Linux and Windows as targets, MSVC 2017 and GCC 5)

I am not aware of significant differences between operating systems. My expectation would be that the situation will be similar. On Linux, the GCC library relies on the read/writer lock implementation of glibc. If you want to dig in, you can find the implementation in pthread_rwlock_common.c. It also illustrates the extra complexity that comes with read/writer locks.

There is an old issue for the shared_mutex implementation in Boost (#11798 - Implementation of boost::shared_mutex on POSIX is suboptimal). But it is not clear to me if the implementation can be improved, or if it is only an example that is not well suited for read/writer locks.

  1. Does it make sense to implement cache locking as described in the article?

Frankly, I am skeptical that a read/writer lock will improve performance in such a data structure. The reader operations should be extremely fast, as it is only a lookup. Updating the LRU list also happens outside the read operations (in the Go implementation).

One implementation detail. Using linked lists is not a bad idea here because it makes the update operations extremely fast (you just update pointers). When using std::list keep in mind that it normally involves memory allocations, which you should avoid when you hold the key. It is better to allocate the memory before you acquire locks, as memory allocations are expensive.

In their HHVM project, Facebook has C++ implementations of concurrent LRU caches that look promising:

  1. ConcurrentLRUCache
  2. ConcurrentScalableCache

The ConcurrentLRUCache also uses a linked list (but not std::list) for the LRU list, and tbb::concurrent_hash_map for the map itself (a concurrent hash map implementation from Intel). Note that for locking of the LRU list updates, they did not go for the read/writer approach as in the Go implementation but use a simple std::mutex exclusive lock.

The second implementation (ConcurrentScalableCache) builds on top of the ConcurrentLRUCache. They use sharding to improve scalability. The drawback is that the LRU property is only approximated (depending on how many shards you use). In some some workloads that might reduce the cache hit rate, but it is a nice trick to avoid that all operations have to share the same lock.

  1. Does std::shared_mutex (from C++17) make any difference in performance compared to a timed one?

I do not have benchmark numbers about the overhead, but it looks like comparing apples and oranges. If you need the timing feature, you have no real choice but to use std::shared_timed_mutex. But if you do not need it, you can simply use std::shared_mutex, which has to do less work and thus should never be slower.

I would not expect the timing overhead to be too serious for typical scenarios when you need timeouts, as the locks tend to be hold longer in that cases anyway. But as said, I cannot back that statement with real measurements.

like image 109
Philipp Claßen Avatar answered Oct 22 '22 14:10

Philipp Claßen


So, which problems are actually can be solved by std::shared_mutex.

Let's imagine you are writing some real-time audio software. You have some callback which is called by driver 1000 times per second and you have to put 1 ms of audio data into its buffer to let hardware play it in next 1 ms. And you have "big" buffer of audio data (let's say 10 seconds) which is rendered by some other thread at background and written once every 10 seconds. Also you have 10 more threads which want to read data from the same buffer (to draw something on UI, send by network, control external lights and so on). This is real tasks of real DJ-software, not a joke.

So, at every callback call (every 1 ms) you have very-very low chances to have conflict with writer thread (0.01%), but you have nearly 100% chance to have a conflict with another reader thread - they work all the time and read the same buffer! So, let's say some thread which reads data from this buffer locked std::mutex and decided to send something via network, then wait for response for the next 500 ms - you'll be locked, can't do anything, your hardware will not get next portion of sound and it will play silence (imagine this while some concert, for example). This is a disaster.

But here is the solution - use std::shared_mutex and std::shared_lock for all reader threads. Yes, average lock of std::shared_lock will cost you more (let's say not 50 nanosecons, but 100 nanoseconds - this is still very cheap even for your real-time app, which should write buffer in 1 ms max), but you are 100% safe from worst case when another reader thread locks your performance-critical thread by 500 ms.

And that's the reason to use std::shared_mutex - to avoid/improve worst cases. Not to improve average performance (this should be achived in some other ways).

like image 38
Ezh Avatar answered Oct 22 '22 14:10

Ezh