A typical spurious Larkin just so story. Arguing from ignorance.
Branch prediction and speculative execution on these CPUs is so good
that they only mispredict on the final termination before loop exit.
Nice in theory, wrong in practise. Try it.
I did. Theory and practice agree just fine here. The differences between
CPUs though even in the Core 2 family is startling even on the
relatively small sample I have immediately to hand.
There is absolutely no measurable difference between a count up or count
down assembler loop on a Core2 Quad, Core 2 Duo or P4. There is enough
slack with the faster CPUs to hide 3 immediate operand instructions
without altering the loop timing at all (just detectable on a P4 3GHz).
Fastest on the Core 2 Q6600 was the modern array style indexing
mov eax, [edi+4*ecx] etc at 2.159 +/- 0.008
This is typical of modern compiler output for x86.
Runner up was the cache aware algorithm at 2.190 +/- 0.007
Obvious simple pointer based loop gave 2.233 +/- 0.007
Cutting out word aligned 16 bit reads gave 2.215 +/- 0.008
Loop unrolling slowed it down to 2.250
SIMD was slowest of all at 2.288 (very disappointing)
The big surprise was the portable Mobile Core 2 Duo T5200 @ 1.6GHz
SIMD was fastest at 2.356 +/- 0.020
Array style indexing next at 2.420 +/- 0.010
Pointer loop using addition 2.450 +/- 0.012
Pointer loop using LOOP 3.225 +/- 0.026 !!!!
*Big* surprise using LOOP instruction on this older Core2 CPU slowed
things down by a massive 0.8s. I repeated it several times as I didn't
believe it at first. But it is a solid result.
On older and weaker CPUs the SIMD and cache aware code did better.
Try it. I went from 0.225 seconds to 0.207 by counting down.
You have no idea what you are talking about. It is some coincidental
change in the generated code (possibly use of LOOP instruction to count
down).
If you knew how to examine the generated code you would have done so by
now. But instead you keep harping on about the magical properties of
Power Basic.
I grant you its code generator must be fairly good - but the timing
pattern on the latest CPUs is extremely flat. In other words any half
way reasonable loop construct executes much faster than the memory
subsystem can supply the data to work on. Evidence is that it is the
write back to main memory that is causing all the problems.
The code generated is all that matters going up or down memory makes no
difference at all. There is more than enough slack in the loop timing.
It is utterly dominated by memory access delays.
Regards,
Martin Brown