Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simulating LDREX/STREX (load/store exclusive) in Cortex-M0

In the Cortex-M3 instruction set, there exist a family of LDREX/STREX instructions such that if a location is read with an LDREX instruction, a following STREX instruction can write to that address only if the address is known to have been untouched. Typically, the effect is that the STREX will succeed if no interrupts ("exceptions" in ARM parlance) have occurred since the LDREX, but fail otherwise.

What's the most practical way to simulate such behavior in the Cortex M0? I would like to write C code for the M3 and have it portable to the M0. On the M3, one can say something like:

__inline void do_inc(unsigned int *dat)
{
  while(__strex(__ldrex(dat)+1,dat)) {}
}

to perform an atomic increment. The only ways I can think of to achieve similar functionality on the Cortex-M0 would be to either:

  1. Have "ldrex" disable exceptions and have "strex" and "clrex" re-enable them, with the requirement that every "ldrex" must be followed soon thereafter by either a "strex" or "clrex".
  2. Have "ldrex", "strex", and "clrex" be a very small routines in RAM, with one instruction of "ldrex" being patched to either "str r1,[r2]" or "mov r0,#1". Have the "ldrex" routine plug a "str" instruction into the "strex" routine, and have the "clrex" routine plug "mov r0,#1" there. Have all exceptions that might invalidate a "ldrex" sequence call "clrex".

Depending upon how the ldrex/strex functions are used, disabling interrupts might work reasonably, but it seems icky to change the semantics of "load-exclusive" so as to cause bad side-effects if it's abandoned. The code-patching idea seems like it would achieve the desired semantics, but it seems clunky.

(BTW, side question: I wonder why STREX on the M3 stores the success/failure indication to a register rather than simply setting a flag? Its actual operation requires four extra bits in the opcode, requires that a register be available to hold the success/failure indication, and requires that a "cmp r0,#0" be used to determine if it succeeded. Was it expected that compilers wouldn't be able to handle a STREX intrinsic sensibly if they didn't get the result in a register? Getting Carry into a register takes two short instructions.)

like image 895
supercat Avatar asked Apr 21 '11 14:04

supercat


2 Answers

The Cortex-M3 was designed to heavy low-latency and low-jitter multitasking, i.e. it's interrupt controller cooperates with the core in order to keep guarantees on number of cycles since interrupt triggering to interrupt handling. The ldrex/strex was implemented as a way to cooperate with all that (by all that I mean interrupt masking and other details such as atomic bit setting via bitband aliases), as otherwise, a single core, non-MMU, non-cache µC would have little use for it. If it didn't implement it though, a low priority task would have to hold a lock and that could generate small priority inversions, with latency and jitter which a hard real time system can't cope with, at least not within the order of magnitude allowed by the "retry" semantics that a failed ldrex/ strex has.

On a side note, and speaking strictly in terms of timings and jitter, the Cortex-M0 has a more traditional interrupt timing profile (i.e. it will not abort instructions on the core when an interrupt arrive), being subject to way more jitter and latency. On this matter (again, strictly timing), it's more comparable to older models (i.e. the arm7tdmi), which also lacks atomic load/modify/store as well as atomic increments & decrements and other low-latency cooperative instructions, requiring interrupt disable/enable more often.

I use something like this in Cortex-M3:

#define unlikely(x) __builtin_expect((long)(x),0)
    static inline int atomic_LL(volatile void *addr) {
      int dest;

  __asm__ __volatile__("ldrex %0, [%1]" : "=r" (dest) : "r" (addr));
  return dest;
}

static inline int atomic_SC(volatile void *addr, int32_t value) {
  int dest;

  __asm__ __volatile__("strex %0, %2, [%1]" :
          "=&r" (dest) : "r" (addr), "r" (value) : "memory");
  return dest;
}

/**
 * atomic Compare And Swap
 * @param addr Address
 * @param expected Expected value in *addr
 * @param store Value to be stored, if (*addr == expected).
 * @return 0  ok, 1 failure.
 */
static inline int atomic_CAS(volatile void *addr, int32_t expected,
        int32_t store) {
  int ret;

  do {
    if (unlikely(atomic_LL(addr) != expected))
      return 1;
  } while (unlikely((ret = atomic_SC(addr, store))));
  return ret;

}

In other words, it takes ldrex/strex into well-known Linked Load and Store Conditional, and with it it also implements the Compare and Swap semantics.

If your code does fine with only compare-and-swap, you can implement it for cortex-m0 like this:

static inline int atomic_CAS(volatile void *addr, int32_t expected,
        int32_t store) {
  int ret = 1;

   __interrupt_disable();
   if (*(volatile uint32_t *)addr) == expected) {
      *addr = store;
      ret = 0;
   }
   __interrupt_enable();
   return ret;
}

That's the most used pattern because some architectures originally only had it (x86 comes to mind).

Implementing an emulation of LL/SC pattern by CAS seems ugly from where I stand. Specially when the SC is more than a few instructions apart from LL, but although very common, ARM doesn't recommend it specially in the Cortex-M3 case because as any interrupts will make strex fail, if you start to taking too long between ldrex/strex your code will spend a lot of time in a loop retrying strex, which could be interpreted as abusing the pattern and defeating it's own purpose.

As for your side question, in the cortex-m3 case the strex return in a register because the semantics were already defined by higher-level architectures (strex/ldrex exists in multi-core arms that were implemented before armv7-m was defined, and after it, where the cache controllers actually check for ldrex/strex addresses, i.e. strex only fails when the cache controller can't prove the dataline the load/store touches was unmodified).

If I were to speculate, I'd say the original design have this semantic because in early days this kind of atomics were designed thinking in libraries: you'd return success/failure in functions implemented in assembler and this would need to respect the ABI and most of them (all I know off) uses a register or stack, and not the flags, to return values.

Also, compilers are better in using register coloring than to clobbering the flags in case some other instruction uses it, i.e. consider a complex operation which generates flags and in the mid of it you have a ldrex/strex sequence, and the operation that comes afterwards needs the flags: the compiler would have to move the flags to a register, requiring extra instruction(s) anyway.

like image 119
Alexandre Pereira Nunes Avatar answered Oct 13 '22 16:10

Alexandre Pereira Nunes


Well... you still have SWP remaining, but it's a less powerful atomic instruction.

Interrupt disabling is sure to work though. :-)

Edit:

No SWP on -m0, sorry supercat.

OK, seems you're only left with interrupt disabling. You can use gcc-compilable inline asm as a guide how to disable and properly restore it: http://repo.or.cz/w/cbaos.git/blob/HEAD:/arch/arm-cortex-m0/include/lock.h

like image 38
domen Avatar answered Oct 13 '22 16:10

domen