Timing results for STM clones
5 1806 18 18 18 18 18 18 18 18 18 18 18means that function number 5 needed 1806 clocks for smallest number of iterations (which in this case is 100). Increasing number of iterations by one intreases execution time by 18 clocks. This agrees quite well with 1806 clocks for 100 iterations since there is small constant overhead. Differences are printed as unsigned 16 bit number. Iteration counts were chosen such that clock count should be significantly smaller than 8000. For MH32 we get lines like:
0 1698 65459 16 16 16 16 16 16 16 16 16 16Printed value 65459 really means that difference was negative, so really is -77. This means that first call took 93 more clocks than expected, which probably is time needed to fetch code form SPI flash. The first 4 tests access regiser on APB2 bus, after block of 20 results obtained with APB2 divisor equal to 1, we repeat the first 4 tests with APB2 divisor equal to 2 and again with ABP2 disisor equal to 4. Together this gives block of 28 lines. We repeat mesurements for various flash latencies. We print line annoucing used latency before each block of 28 lines. Comment: For simplity we run code in RAM also with varying flash latency, but since tests did not access flash any possible differences are noise.
uint32_t test_delay(void (*fn)(uint32_t, uint32_t, uint32_t, uint32_t), uint32_t cnt, uint32_t addr) { uint32_t t1 = STK_CVR; (*fn)(cnt, addr, 0, 1); return t1 - STK_CVR; }In other words, we call each tested function via a function pointer providing it with four arguments. First one is requested number of iterations, second is memory/peripherial address for functions testing write speed (not used by delay loops), the last two are values to write (used only by optimized write functions).Explanation f test loops
We use two different write loops, a simpler version and optimized one that keeps all constants in registers. We have 3 kinds of addresses: one on APB2 bus (GPIO A ODR), second on AHB bus (one of DMA registers) and third in memory. For each kind of address we try two modes: bit band access that changes single bit and word write. Together this gives 12 combinations. For APB2 accesses we also vary clock divisor on APB2 bus, trying divisor 1, 2 and 4. Other tests should not depend on APB2 clock divisor so we do then just with APB2 divisor equal to 1. We have 8 different delay loops. Test 12 is simple 2 instructions, 4-byte long delay loop properly aligned to 8 byte boundary. Test 13 is the same loop, but deliberatly misaligned to cross 8 byte boundary. Test 14 is misaligned and loop body is padded with 3 NOP-s so that it crosses 2 boundaries between 8 byte blocks. Test 15 adds one more NOP to loop body. Test 16 is properly aligned, but loop body has 7 NOP-s, so length of body is 18 bytes and conseqently it involves 3 consequtive blocks of 8 bytes. Test 17 has 14 NOP-s in the loop body and length of loop body is 32 bytes. Test 18 has 16 NOP-s (length of body 36 bytes), test 19 has 30 NOP-s (length of loop body 64 bytes).