When running some tests regarding delay loops in assembly, I ran across some differences in timing between a CKS32F103 and an STM32F103 based Blue Pill. This lead me to run some tests across my collection of 32F103 chips.

All chips involved have implemented the cycle counter in the DWT. We will use this to determine how many clock cycles it took to run a 1000 cycles of a delay loop. The delay loops are defined as follow:

__attribute__( ( section(".ramfunc") ) )
void test_cycles_ram(uint32_t time_cycles) {
	asm("ramloop:" );
	asm("subs  r0, 1"  ); 
	asm("bhi ramloop");
}
 
void test_cycles_rom(uint32_t time_cycles) {
	asm("romloop:" );
	asm("subs  r0, 1"  ); 
	asm("bhi romloop");
}

Note that I put the test_cycles_ram in section .ramfunc, and put the following in my linker file

    .data :
    {
	PROVIDE( _sdata = . );
	*(.ramfunc*);  /* Functions excuted from RAM */
        *(.data*);
        PROVIDE( _edata = . );
    } > RAM AT >FLASH

such that the function is placed in RAM. By putting it in this position, the startup code will copy the function into RAM upon boot.

I measure the cycle counter before and after calling these delay loops, with a parameter to let it delay a 1000 cycles. The results presented below are the number of cpu cycles divided by 1000, to obtain the number of cycles one iteration through the loop took. Please note that the results took some additional cycles. Fetching the cycle counter, calling the function, etc. take some cycles as well, and there are some differences there as well. However, those are out-of-scope for this test.

Anyhow, some results are here, what do we see? Running at the default speed. I haven’t actually verified this to be HSI 8 MHz for all cores, but at this moment, I’ll assume they’re all at 8 MHz on the internal oscillator when they boot up. Then we see all of them, except the MM32, are at 3 cycles per loop running from flash. Running from RAM, at the default clock speed, is always slower then running from flash. Where the STM32 has 4 cycles per loop, the GD32 has 9 cycles per loop. It’s less then half the speed!

The other speeds tested, are running on the external oscillator, using an 8 MHz crystal, and are configured at 48 MHz with 1 wait state, and 72 MHz with 2 wait states.

The number of cycles when running from RAM is not affected by the speed the MCU is running at. This confirms the wait states are only applied to flash access.
It differs from 4 to 9 cycles across the test subjects. The number of cycles for running from flash is not affected by speed on the GD32, HK32, CH32, FCM32, RX32 and MH32. So, these parts ignore the wait states. I was aware the GD32 does this, but it seems this applies to a lot of other 32F103 variants as well.
The STM32, CS32, APM32 and MM32 do appear to implement wait states. However, for the CS32 the results are the same at 48 or 72 MHz.
The GD32, CS32, HK32, CH32, FCM32, RX32 and MH32 are running faster from flash then the STM32 when running at 72 MHz.

I might look into running some benchmarks in the future. What would be a good option? The Dhrystone benchmark? It’s as old as I am! CoreMark? Might be a bit too complicated.

 

STM32F103 GD32F103 CS32F103 HK32F103
HSI 8 MHz (Boot default)
Flash 3 3 3 3
RAM 4 9 6 4
HSE 48 MHz
Flash 4 3 4 3
RAM 4 9 6 4
HSE 72 MHz
Flash 6 3 4 3
RAM 4 9 6 4

 

CH32F103 FCM32F103 RX32F103 AIR32F103
MH32F103
HSI 8 MHz (Boot default)
Flash 3 3 3 3
RAM 4 4 6 4
HSE 48 MHz
Flash 3 4 3 3
RAM 4 4 6 4
HSE 72 MHz
Flash 3 3 3 3
RAM 4 4 6 4

 

 

APM32F103
ApexMic
APM32F103
Geehy
MM32F103
HSI 8 MHz (Boot default)
Flash 3 3 4
RAM 4 4 4
HSE 48 MHz
Flash 4 4 6
RAM 4 4 4
HSE 72 MHz
Flash 6 6 7
RAM 4 4 4

Note that I tested two APM32 chips, one from 2019 when they were produced by ApexMic, and one more recent chip produced by Geehy. At least for this test they perform the same.

Also, a reader submitted this link where they did some testing of their own.