Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest?

Tags:

I have a 'server' program that updates many linked lists in shared memory in response to external events. I want client programs to notice an update on any of the lists as quickly as possible (lowest latency). The server marks a linked list's node's state_ as FILLED once its data is filled in and its next pointer has been set to a valid location. Until then, its state_ is NOT_FILLED_YET. I am using memory barriers to make sure that clients don't see the state_ as FILLED before the data within is actually ready (and it seems to work, I never see corrupt data). Also, state_ is volatile to be sure the compiler doesn't lift the client's checking of it out of loops.

Keeping the server code exactly the same, I've come up with 3 different methods for the client to scan the linked lists for changes. The question is: Why is the 3rd method fastest?

Method 1: Round robin over all the linked lists (called 'channels') continuously, looking to see if any nodes have changed to 'FILLED':

Click to copy

void method_one()
{
    std::vector<Data*> channel_cursors;
    for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
    {
        Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
        channel_cursors.push_back(current_item);
    }

    while(true)
    {
        for(std::size_t i = 0; i < channel_list.size(); ++i)
        {   
            Data* current_item = channel_cursors[i];

            ACQUIRE_MEMORY_BARRIER;
            if(current_item->state_ == NOT_FILLED_YET) {
                continue;
            }

            log_latency(current_item->tv_sec_, current_item->tv_usec_);

            channel_cursors[i] = static_cast<Data*>(current_item->next_.get(segment));
        }
    }
}

Method 1 gave very low latency when then number of channels was small. But when the number of channels grew (250K+) it became very slow because of looping over all the channels. So I tried...

Method 2: Give each linked list an ID. Keep a separate 'update list' to the side. Every time one of the linked lists is updated, push its ID on to the update list. Now we just need to monitor the single update list, and check the IDs we get from it.

Click to copy

void method_two()
{
    std::vector<Data*> channel_cursors;
    for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
    {
        Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
        channel_cursors.push_back(current_item);
    }

    UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));

    while(true)
    {   
            ACQUIRE_MEMORY_BARRIER;
        if(update_cursor->state_ == NOT_FILLED_YET) {
            continue;
        }

        ::uint32_t update_id = update_cursor->list_id_;

        Data* current_item = channel_cursors[update_id];

        if(current_item->state_ == NOT_FILLED_YET) {
            std::cerr << "This should never print." << std::endl; // it doesn't
            continue;
        }

        log_latency(current_item->tv_sec_, current_item->tv_usec_);

        channel_cursors[update_id] = static_cast<Data*>(current_item->next_.get(segment));
        update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
    }   
}

Method 2 gave TERRIBLE latency. Whereas Method 1 might give under 10us latency, Method 2 would inexplicably often given 8ms latency! Using gettimeofday it appears that the change in update_cursor->state_ was very slow to propogate from the server's view to the client's (I'm on a multicore box, so I assume the delay is due to cache). So I tried a hybrid approach...

Method 3: Keep the update list. But loop over all the channels continuously, and within each iteration check if the update list has updated. If it has, go with the number pushed onto it. If it hasn't, check the channel we've currently iterated to.

Click to copy

void method_three()
{
    std::vector<Data*> channel_cursors;
    for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
    {
        Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
        channel_cursors.push_back(current_item);
    }

    UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));

    while(true)
    {
        for(std::size_t i = 0; i < channel_list.size(); ++i)
        {
            std::size_t idx = i;

            ACQUIRE_MEMORY_BARRIER;
            if(update_cursor->state_ != NOT_FILLED_YET) {
                //std::cerr << "Found via update" << std::endl;
                i--;
                idx = update_cursor->list_id_;
                update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
            }

            Data* current_item = channel_cursors[idx];

            ACQUIRE_MEMORY_BARRIER;
            if(current_item->state_ == NOT_FILLED_YET) {
                continue;
            }

            found_an_update = true;

            log_latency(current_item->tv_sec_, current_item->tv_usec_);
            channel_cursors[idx] = static_cast<Data*>(current_item->next_.get(segment));
        }
    }
}

The latency of this method was as good as Method 1, but scaled to large numbers of channels. The problem is, I have no clue why. Just to throw a wrench in things: if I uncomment the 'found via update' part, it prints between EVERY LATENCY LOG MESSAGE. Which means things are only ever found on the update list! So I don't understand how this method can be faster than method 2.

The full, compilable code (requires GCC and boost-1.41) that generates random strings as test data is at: http://pastebin.com/0kuzm3Uf

Update: All 3 methods are effectively spinlocking until an update occurs. The difference is in how long it takes them to notice the update has occurred. They all continuously tax the processor, so that doesn't explain the speed difference. I'm testing on a 4-core machine with nothing else running, so the server and the client have nothing to compete with. I've even made a version of the code where updates signal a condition and have clients wait on the condition -- it didn't help the latency of any of the methods.

Update2: Despite there being 3 methods, I've only tried 1 at a time, so only 1 server and 1 client are competing for the state_ member.

996

asked Mar 28 '10 01:03

Joseph Garvin

1 Answers

Hypothesis: Method 2 is somehow blocking the update from getting written by the server.

One of the things you can hammer, besides the processor cores themselves, is your coherent cache. When you read a value on a given core, the L1 cache on that core has to acquire read access to that cache line, which means it needs to invalidate the write access to that line that any other cache has. And vice versa to write a value. So this means that you're continually ping-ponging the cache line back and forth between a "write" state (on the server-core's cache) and a "read" state (in the caches of all the client cores).

The intricacies of x86 cache performance are not something I am entirely familiar with, but it seems entirely plausible (at least in theory) that what you're doing by having three different threads hammering this one memory location as hard as they can with read-access requests is approximately creating a denial-of-service attack on the server preventing it from writing to that cache line for a few milliseconds on occasion.

You may be able to do an experiment to detect this by looking at how long it takes for the server to actually write the value into the update list, and see if there's a delay there corresponding to the latency.

You might also be able to try an experiment of removing cache from the equation, by running everything on a single core so the client and server threads are pulling things out of the same L1 cache.

119

answered Sep 21 '22 20:09

Brooks Moses

Related questions
                            
                                Where is the race in this thread sanitzer warning?
                            
                                Why is a partial class template specialization on a matching template class ambiguous with another partial specialization without the template match?
                            
                                Is it OK to store information regarding an error in an Exception object?
                            
                                Difference of C++17 and C++20 in template friend function with unary and binary operators
                            
                                template type name error when same name is used for static member function
                            
                                GCC can not resolve method call with defaulted parameter and following parameter pack
                            
                                Parsing binary file too slow in C++ using memory-mapped files
                            
                                Getting error "CUDA backend requires cuDNN" when configuring OpenCV cmake build with cuda backend turned on
                            
                                Why is there no std::data() overload for std::valarray?
                            
                                Why isn't std::next/prev templated by distance like std::advance?
                            
                                Is it allowed to print the address of a dangling reference?
                            
                                Is generating unique ID from template template parameters UB?
                            
                                Return type deduction for methods of nested classes
                            
                                VARIANT_BOOL vs. BOOL
                            
                                why does the BTNS_DROPDOWN style cause the whole toolbar to move down a couple pixels?
                            
                                Simple USB host stack
                            
                                Capture Screen Image in C++ on OSX
                            
                                Is there a way to get better information for the context of an error when using msvc? (ex: C2248)
                            
                                Writing binary files using C++: does the default locale matter?
                            
                                Cubic Spline Interpolation in C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest?

Tags:

c++

performance

multithreading

latency

shared-memory

Joseph Garvin

People also ask

1 Answers

Brooks Moses

Recent Activity

Donate For Us