I'm using boost spsc_queue
to move my stuff from one thread to another. It's one of the critical places in my software so I want to do it as soon as possible. I wrote this test program:
#include <boost/lockfree/spsc_queue.hpp>
#include <stdint.h>
#include <condition_variable>
#include <thread>
const int N_TESTS = 1000;
int results[N_TESTS];
boost::lockfree::spsc_queue<int64_t, boost::lockfree::capacity<1024>> testQueue;
using std::chrono::nanoseconds;
using std::chrono::duration_cast;
int totalQueueNano(0);
int totalQueueCount(0);
void Consumer() {
int i = 0;
int64_t scheduledAt;
while (i < N_TESTS - 1) {
while (testQueue.pop(scheduledAt)) {
int64_t dequeuedAt = (duration_cast<nanoseconds>(
std::chrono::high_resolution_clock::now().time_since_epoch())).count();
auto diff = dequeuedAt - scheduledAt;
totalQueueNano += diff;
++totalQueueCount;
results[i] = diff;
++i;
}
}
for (int i = 0; i < N_TESTS; i++) {
printf("%d ", results[i]);
}
printf("\nspsc_queue latency average nano = %d\n", totalQueueNano / totalQueueCount);
}
int main() {
std::thread t(Consumer);
usleep(1000000);
for (int i = 0; i < N_TESTS; i++) {
usleep(1000);
int64_t scheduledAt = (duration_cast<nanoseconds>(
std::chrono::high_resolution_clock::now().time_since_epoch())).count();
testQueue.push(scheduledAt);
}
usleep(1000000);
return 0;
}
Compile flags:
g++ -std=c++0x -O3 -Wall -c -fmessage-length=0 -march=native -mtune=native -pthread -MMD -MP -MF"src/TestProject.d" -MT"src/TestProject.d" -o "src/TestProject.o" "../src/TestProject.cpp"
g++ -pthread -o "TestProject" ./src/TestProject.o -lpthread
On my machine: RHEL 7.1, gcc 4.8.3, Xeon E5-2690 v3 I receive 290-300 nanoseconds.
upd: queue mechanism is required. if first thread produce data every 1000 nanoseconds, but second thread spents 10 000 nanoseconds to process single item I need to "queue" several items for a short period of time. But my "queue" is never "too big". fixed-size short ring-buffer must be enough.
upd2 So in short the question is - what is the fastest single producer single consumer queue (most likely based on fixed size ringbuffer)? I'm using boost spsc_queue and I achieve ~300 ns latency, can you suggest something faster?
upd3 in java world there is disruptor that achieve 50 ns latency https://code.google.com/p/disruptor/wiki/PerformanceResults Do we have something in c++ with the same 50 ns latency?
Since you have int
s, what you (ideally) measure above is the overall latency between a call to push()
to the time pop()
returns true
.
This doesn't make sense: The consumer thread is busily polling the queue, that is it loops and busily checks whether pop
has fetched a value.
If (IFF) you want to minimize latency (for a single item), my guess would be to use a signaling synchronization mechanism, spsc_queue
, as far as I can tell, does not provide for this. (You'd need a container or custom solution where you employ a kind of condition variable / Event, ...)
If (IFF), however, you want to maximise throughput (items per time), then measuring the latency for a "wakeup" of a (single) item does make even less sense. In that case you want to make the best use of the parallelism you have, as is mentioned in a comment:
Often the fastest way to pass data is to use a single thread for each chunk of data. That is to say, use only the parallelism present in the data.
Addressing your bullet points:
How good is the test app: I do not think it makes much sense.
scheduledAt
in an atomic is required, as you write it from one thread and read it from another. Otherwise you have UB.struct {int val; int64_t time; };
into the queue, thereby avoiding the atomic fence.Current industry best time : no clue. Not sure anyone cares about this. (Maybe inside some kernel stuff?)
Choice of spsc_queue : I don't think it is a good choice because it requires polling.
faster than spsc_queue? : See above. Use non-polling notification.
write a code which do same work significantly faster? : No. Or rather, I won't. =>
To quote "man"s answer:
- you define the problem and select an appropriate synchronization mechanism
The problem with your question is that there is no problem definition.
As far as I am concerned so far, in the context of a user-land process on a regular OS, cross thread notification latency seems utterly irrelevant. What is your use case?
First of all, writing such a test program is completely useless. You don't do any work with the data so the results are skewed. Second, your test is using usleep() between pushes - at this rate you can use any kind of synchronization primitive. It also seems that your Consumer() never exits...
The way you implement such a thing is the following:
You need some previous experience at the first step or you can try to implement different approaches and see what works best.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With