Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cortex M4 LDR/STR timing

I am reading through Cortex M4 TRM to understand instruction execution cycles. However, there are some confusing description there

  1. In Table of Processor Instuctions, STR takes 2 cycles.
  2. Later in Load/store timings, it indicates that

STR Rx,[Ry,#imm] is always one cycle, This is because the address generation is performed in the initial cycle, and the data store is performed at the same time as the next instruction is executing.

If the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is delayed until the store can complete.

If the store is not to the write buffer, for example to the Code segment, and that transaction stalls, the impact on timing is only felt if another load or store operation is executed before completion

  1. Still in Load/store timings, it indicates LDR can be pipelined by following LDR and STR, but STR can't be pipelined by following instructions.

Other instructions cannot be pipelined after STR with register offset. STR can only be pipelined when it follows an LDR, but nothing can be pipelined after the store. Even a stalled STR normally only takes two cycles, because of the write buffer

More specific on what confused me:

Q1. 1 and 2 seems conflict with each other, how many cycles do STR actually take, 1 or 2? (My experiment shows 1 though)

Q2. 2 indicates that if store go through write buffer and it is not available, it will stall the pipeline nevertheless, but if store bypass it, the pipeline may only stalled when load/store instructions are following. Smells like write buffer can only make things worse. That is contrary to common sense.

Q3. 3 means STR can't be pipelined with following instruction, however 2 means STR is always pipelined with following instruction under proper condition. How to understand the conflicting statements? (And here it indicates STR takes 2 instead of 1 cycle because of the write buffer)

Q4. I don't find more information on how write buffer is imeplemented. How large is the buffer? How STR determine whether to use it or bypass it?

like image 851
Eric Sun Avatar asked Aug 05 '20 08:08

Eric Sun


2 Answers

Type of STR Note that on "Load/Store timings page" the first statement refers to STR with a literal offset to the base address register (STR Rx,[Ry,#imm]). Further down it refers to an STR with a register offset to the base address register (STR R1,[R3,R2]). These are two different variants of the STR instruction.

Literal Offset STR(STR Rx,[Ry,#imm]) Hmm, I wonder if the documentation is mis-leading when it says "always 1 cycle", because it then follows to add a caveat that means it could take multiple cycles "... the next instruction is delayed until the store can complete"

I am going to do my best to interpret the documentation:

STR Rx,[Ry,#imm] is always one cycle. This is because the address generation is performed in the initial cycle, and the data store is performed at the same time as the next instruction is executing. If the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is delayed until the store can complete. If the store is to the write buffer, for example to the Code segment, and that transaction stalls, the impact on timing is only felt if another load or store operation is executed before completion.

I would assume that the first STR takes 1 cycle, if the write buffer is available. If it is not available, the next instruction will be stalled until the buffer is available. However, if the buffer is not in use, it will delay the next instruction until the bus transaction completes.

With a non consecutive STR (the first STR) the write buffer will be empty, and the instruction takes 1 cycle. If there are 2 consecutive STR instructions, the 2nd STR will begin immediately as the 1st STR has written to the buffer. However, if the bus transaction for the 1st STR stalls and remains in the write buffer, the 2nd STR will be unable to write to the buffer and will block further instructions. Then when the bus transaction for the 1st STR completes the buffer is emptied and the 2nd STR writes to the buffer, unblocking the next instruction.

A stalled bus transaction, where the transaction is buffered in the write buffer, doesn't affect non STR instructions as they do not need access to the write buffer to complete. So an STR instruction where the bus is stalled will not delay further instructions unless it is another STR. However, if the write buffer is not in use then a stalled bus transaction will delay all instructions.

It does seem a bit off that the instruction set summary page puts a solid "2" as the number of cycles for STR when clearly it is not as predictable as this.

Register offset STR(STR R1,[R3,R2]) I stand with you on your confusion over the following apparently conflicting statement:

Other instructions cannot be pipelined after STR with register offset. STR can only be pipelined when it follows an LDR, but nothing can be pipelined after the store. Even a stalled STR normally only takes two cycles, because of the write buffer.

As this is contradicted by the first clause on the page. But, I believe this is because it is refering to 2 different STR types, literal offset (the first one) and register offset. The register offset STR being the one that can't allow pipelined instructions afterwards. The language could be clearer though. What does it mean by a stalled STR, is it refering to a register offset STR which always stalls by default? Is this stall different to a stall caused by the write buffer being unavailable? It is easy to get lost here.

I think basically a register offset STR is a minimum of 2 cycles. It is going to block and take more cycles if the write buffer is unavailable, or if the transaction is not buffered and the bus stalls.

Size of write buffer The size is a single entry, see https://developer.arm.com/documentation/100166/0001/Programmers-Model/Write-buffer?lang=en

To prevent bus wait cycles from stalling the processor during data stores, buffered stores to the DCode and System buses go through a one-entry write buffer. If the write buffer is full, subsequent accesses to the bus stall until the write buffer has drained.

The write buffer is only used if the bus waits the data phase of the buffered store, otherwise the transaction completes on the bus.

Usefulness of write buffer As far as my understanding goes: If the CPU could write to a bus instantly then it would not need a buffer as the bus would be free immediately for the next instruction. On a high performance part like M4 some of the memory buses can't keep up with the CPU clock rate which means it could take multiple cycles to perform a transaction. Also there could be DMA units that make use of the same bus. To prevent stalling the CPU until a bus transaction completes, the buffer provides an immediate store to use which hardware then writes to the bus when it is free.

like image 170
EmbeddedSoftwareEngineer Avatar answered Oct 16 '22 18:10

EmbeddedSoftwareEngineer


@EmbeddedSoftwareEngineer, thanks for the reply. I'd like to post what I summarized from my experiment

  1. As a baseline, LDR takes 2 cycles, STR takes 1 cycle
  2. There are 2 kinds of dependency for adjacent instructions
    • content dependency. A typical example is STR followed by a LDR, because the assembly don't make sure the LDR target memory is not modified by STR, it always get delay,that is 3 cycles for LDR
    • addressing dependency. When 2nd instruction's address is based on result of first instruction, the 2nd instruction always get delay, typical example
      sub SP, SP, #20
      ldr r1, [SP, #4]
      ;OR
      ldr r3, [SP, #8]
      ldr r4, [r3]
      
      The second LDR will always get an extra wait cycle, yields 3 cycles
  3. When there is no dependencies described in 2, LDR following LDR will take 1 cycle, STR following LDR will take 0 cycle

All these are based on TCM which introduce no extra cycle from cache load or external bus stall.

like image 2
Eric Sun Avatar answered Oct 16 '22 18:10

Eric Sun