Timing results for STM clones

Raw results

Format of the results

Each line starting with number represnts series of 12 measurements, of single function with increasing number of iterations. First number identifies tested function. Second number is number of clock cykles needed for smallest iteration count. Next 11 numbers are differences between previous clock count and current one. We increase number of iterations by one, so line:
 5  1806  18  18  18  18  18  18  18  18  18  18  18
means that function number 5 needed 1806 clocks for smallest number of iterations (which in this case is 100). Increasing number of iterations by one intreases execution time by 18 clocks. This agrees quite well with 1806 clocks for 100 iterations since there is small constant overhead. Differences are printed as unsigned 16 bit number. Iteration counts were chosen such that clock count should be significantly smaller than 8000. For MH32 we get lines like:
 0  1698 65459  16  16  16  16  16  16  16  16  16  16
Printed value 65459 really means that difference was negative, so really is -77. This means that first call took 93 more clocks than expected, which probably is time needed to fetch code form SPI flash. The first 4 tests access regiser on APB2 bus, after block of 20 results obtained with APB2 divisor equal to 1, we repeat the first 4 tests with APB2 divisor equal to 2 and again with ABP2 disisor equal to 4. Together this gives block of 28 lines. We repeat mesurements for various flash latencies. We print line annoucing used latency before each block of 28 lines. Comment: For simplity we run code in RAM also with varying flash latency, but since tests did not access flash any possible differences are noise.

Test program

Test loops are in assembly basic loop , other loops , they are called from main program by mearurement routine:
uint32_t
test_delay(void (*fn)(uint32_t, uint32_t, uint32_t, uint32_t),
           uint32_t cnt, uint32_t addr) {
    uint32_t t1 = STK_CVR;
    (*fn)(cnt, addr, 0, 1);
    return t1 - STK_CVR;
}

In other words, we call each tested function via a function
pointer providing it with four arguments.  First one is
requested number of iterations, second is memory/peripherial
address for functions testing write speed (not used by delay
loops), the last two are values to write (used only by optimized
write functions).

Explanation f test loops

We use two different write loops, a simpler version and optimized one that keeps all constants in registers. We have 3 kinds of addresses: one on APB2 bus (GPIO A ODR), second on AHB bus (one of DMA registers) and third in memory. For each kind of address we try two modes: bit band access that changes single bit and word write. Together this gives 12 combinations. For APB2 accesses we also vary clock divisor on APB2 bus, trying divisor 1, 2 and 4. Other tests should not depend on APB2 clock divisor so we do then just with APB2 divisor equal to 1. We have 8 different delay loops. Test 12 is simple 2 instructions, 4-byte long delay loop properly aligned to 8 byte boundary. Test 13 is the same loop, but deliberatly misaligned to cross 8 byte boundary. Test 14 is misaligned and loop body is padded with 3 NOP-s so that it crosses 2 boundaries between 8 byte blocks. Test 15 adds one more NOP to loop body. Test 16 is properly aligned, but loop body has 7 NOP-s, so length of body is 18 bytes and conseqently it involves 3 consequtive blocks of 8 bytes. Test 17 has 14 NOP-s in the loop body and length of loop body is 32 bytes. Test 18 has 16 NOP-s (length of body 36 bytes), test 19 has 30 NOP-s (length of loop body 64 bytes).