Delaying a millisecond on a microcontroller. While, of course, we could use timer peripherals specific to our MCU, I’m looking for a more generic solution.
On a Cortex M3 (or better) I’d check for the DWT, and if fitted, check whether it implements the cycle counter. However, on a Cortex M0 or (M0+) we have no such.
Therefore I’m looking into some while loop counting cycles. As I wish to count cycles, I’ll have to write the cycle in assembly to ensure the same duration
regardless of the compiler optimisation flags,

When running on a Cortex-M3, we’d have to check for the presence of the cycle counter. While all Cortex M3 and M4 MCUs I have encountered implement it, one should be aware it is an optional component. Therefore I’ll write a delay cycles implementation on both Thumb and Thumb2 assembly. Furthermore, I’ll execute this code from RAM, to avoid flash delays.

Thumb / ARMv6-M

// On an STM32F072 (Cortex M0)
loop:
    sub  r0, r0, #4 // 1 cycle
    bhi  loop       // 3 cycles 

When running this code on an STM32L051, it runs slightly fast, and the code is adjusted to

// On an STM32L051 (Cortex M0+)
loop:
    sub  r0, r0, #3 // 1 cycle
    bhi  loop       // 2 cycles

These differences is due the fact the branch instruction takes 3 cycles on a Cortex M0, and 2 cycles on a Cortex M0+. This is according to the specifications.

Thumb2 / ARMv7-M

While an M3 usually has a cycle counter, this way of delaying wouldn’t be much needed. I wrote it anyway, and found some interesting results.

// On a STM32F103
loop:
    subs  r0, 4  // 1 cycle
    bhi   loop   // 1 + P cycles. P 4 appears to be 2 		
// On a CKS32F103
loop:
    subs  r0, 6 // 1 cycle
    bhi   loop  // 1 + P cycles P 4 appears to be 4 (?)

The speed of execution differs from an STM32F103 vs a CKS32F103. I am running the code from RAM, so there shouldn’t be any delays fetching
instructions.

While on the Cortex M0(+) we have a fixed number of cycles for a conditional jump taken, on the Cortex M3, a conditional jump taken, takes 1 + P cycles, where P is the number of cycles required for a pipeline refill. This ranges from 1 to 3 depending on the alignment and width of the target instruction, and whether the processor manages to speculate the address early. The alignment and target instruction are the same as we are running the same code. Would there be a difference in the speculation?

One thing to consider is the fact they are a different revision of the Cortex M3, but still… something is off with the CKS32, as from the numbers I got, it would mean the P is 2 on the STM32, which is within the expected range, however, 4 for the CKS32, which is out of range. I am running from RAM, so FLASH WAIT STAGES should not delay my code. Is there perhaps a bug in the CKS32 that it implements WAIT stages universally when fetching instructions, rather then only from FLASH? To test that I could run at a lower speed that does not require WAIT stages and see if this difference remains.

Whatever is going on… it surely is interesting to see these differences between those two chips. I suppose I should run tests on all my 32F103 clones I got. But this is definitely something that could cause compatibility issues in timing critical code.