Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

atomic memcpy suggestion


While testing a program for scalability, I came across the situation where I have to make my memcpy operation as atomic operation . I have to copy 64bytes of data from one location to other .
I came across one solution, that is using spinning over a variable is :

struct record{
    volatile int startFlag;
    char data[64];
    volatile int doneFlag;
};

and pseudo code follows

struct record *node;
if ( node->startFlag ==0 ) {  // testing the flag 
    if( CompareAndSwap(node->startFlag , 0 ,1 ) ) {  // all thread tries to set, only one will get success and perform memcpy operation 
        memcpy(destination,source,NoOfBytes);
        node->doneFlag = 1; // spinning variable for other thread, those failed in CompAndSwap 
    }
    else {
         while ( node->doneFlag==0 ) { // other thread spinning 
          ; // spin around and/or use back-off policy  
         }
   }}

Can this perform as atomic memcpy ? Though if thread performing memcpy gets preempted ( before or after memcpy but before setting doneFlag ), then others will keep spinning. Or what can be done to make this atomic .
Situation is like other thread must have to wait unless data get copied, since they have to compare with inserted data, with their own data .
I am using test-and-test-and-set approach in case of startFlag to reduce some costly atomic operation. Spin-locks are also Scalable, but i have measured that atomic calls give better performance than spin-lock, moreover i am looking for the problems that can arise in this snippet. And since i am using my own memory-manager, so memory allocation and free calls are costly for me, so using another buffer and copy content in it, then setting pointer ( since pointer size is under atomic operation) is costly, since it will require many mem-alloc and mem-free calls.

EDIT I am not using mutex, because they doesn't seems to be scalable moreover this is just a part of program, so critical section is not this small ( i understand that for larger critical section it is hard to use atomic operations ).

like image 371
peeyush Avatar asked Jul 15 '11 08:07

peeyush


3 Answers

Your code snippet is definitely broken. There's a race on node->startFlag

Unfortunately, there's no atomic way to copy 64 bytes. I think you have number of options here.

  1. Access node->startFlag in atomic fashion. I've written a couple of posts on the subject: here and here.
  2. Protect entire thing with user-mode spinlock. Here's a post on the subject
  3. Use RCU like approach. You can read about RCU here. In two words, the idea is to reference the buffer you want to copy using a pointer. Then you do:
    1. Allocate new buffer.
    2. Create it's contents (memcpy from your source).
    3. Atomically substitute the buffer with new one.
    4. Wait for all threads accessing old buffer to expire and free it.

Hope it helps. Alex.

like image 82
Alexander Sandler Avatar answered Nov 12 '22 16:11

Alexander Sandler


It's late but just for others arriving to this question, the following is easier faster and put less pressure on the cache.

Note: I changed CAS to the corresponding atomic builtin in GCC. There is no need for "volatile", CAS introduces a memory barrier.

// Simpler structure
struct record {
    int spin = 0;
    char data[64];
};



struct record *node;

while (node->spin || ! __sync_bool_compare_and_swap(&node->spin , 0 , 1)); // spin
memcpy(destination,source,NoOfBytes);
node->spin = 0; 

PS: I'm not sure if a CAS instead of node->spin = 0 could improve efficiency a little more.

like image 25
gallir Avatar answered Nov 12 '22 15:11

gallir


Dont use a lock, use a CriticalSection. Locks are heavy-weight, CriticalSections are extremely, extremely fast (just a couple instructions depending on platform). You did not specify an operating system and the info i post here is experienced in Windows, though other OS's should be similar.

You had some concern that CriticalSections might not be scalable enough for your purpose if they contain a lot of code? The underlying reason (and probably the argument of where you read that) is, that the CriticalSection cannot interleave in multiple threads quite as fine-grained if the threads hold on to the CS for a long time. You can avoid that by just wrapping the CS around only that part of your code that really needs to be atomic. On the other hand: If you use CS too fine-grained the percentual overhead will of course increase. This trade-off you can not avoid with any kind of synchronization.

You say that the atomic operation you need is a copy of 64 Bytes: In that case your synchronization-overhead with CS's will be negligible. Just try it. With the granularity at which you synchronize (around a single copy of 64 bytes or around say 4 of these copies) you can balance thread-interleaving granularity vs. percentual overhead by doing some experimentation. But in general: CS's are fast enough and scalable enough.

like image 3
Bernd Elkemann Avatar answered Nov 12 '22 16:11

Bernd Elkemann