Maker Pro
Maker Pro

Slow vs Fast Multi Dimensional Array Access Index Order.

S

Skybuck Flying

Hello,

When accessing a multi dimensional array it can be done in different index
orders for example:

// slow:
for x:=0 to 199 do
for y:= 0 to 199 do
for z:= 0 to 199 do
array[z,y,x] := ...;

// 4x times faster:
for x:=0 to 199 do
for y:= 0 to 199 do
for z:= 0 to 199 do
array[x,y,z] := ...;

Test system is AMD X2 3800+ with Windows XP x64 Pro Edition, Code compiled
with Borland Delphi 2006 Win32.

I read about different row and column access times for memory chips, and
some time ago somebody mentioned communication between cpu and processor is
faster for sequential memory access...I thought it was fun to write a speed
test program to investigate this. I wrote a test program to try out all
index order combinations for four and three dimensional arrays and there is
indeed a substantial timing difference.

More information on this would be nice...

Bye,
Skybuck.
 
A

Ancient_Hacker

A couple issues with your tests:

(1) Delphi is a wonderful development system, but the compiler doesnt
do all that much optimization.

(2) If you switch to a heavitly optimizing compiler, such as Microsoft
C, the compiler might notice you never use any of the values you store,
and it will even optimize out the store loop!

(3) CPU cache size will have a huge effect on the loop speed. if you
access the array in the order the stuff is stored in memory you can get
a 4x speedup quite easily.
 
S

Skybuck Flying

Ancient_Hacker said:
A couple issues with your tests:

(1) Delphi is a wonderful development system, but the compiler doesnt
do all that much optimization.

(2) If you switch to a heavitly optimizing compiler, such as Microsoft
C, the compiler might notice you never use any of the values you store,
and it will even optimize out the store loop!

(3) CPU cache size will have a huge effect on the loop speed. if you
access the array in the order the stuff is stored in memory you can get
a 4x speedup quite easily.

Why would CPU cache have effect ?

Each "offset" / "index combination" is only accessed once.

The cache could only have effect if the memory for example is divided in
pages... and the page is kept in the cpu cache..

That's something I can understand ;)

Bye,
Skybuck :)
 
T

TheBoffin

Ancient_Hacker said:
A couple issues with your tests:

(1) Delphi is a wonderful development system, but the compiler doesnt
do all that much optimization.

Surely that's a good point in terms of SkyBuck's test program?
 
K

Ken Hagan

Skybuck said:
Why would CPU cache have effect ?

Because as soon as you access the first few bytes of a cache line
(typically 32 bytes or more), the whole line is pre-loaded and
can be accessed almost instantaneously.
Each "offset" / "index combination" is only accessed once.

The cache could only have effect if the memory for example is divided in
pages... and the page is kept in the cpu cache..

Substituting "cache line" for "page", I think that's pretty much
exactly what happens on any desktop PC made in the last 15 years.
 
R

Rich Grise

Hello,

When accessing a multi dimensional array it can be done in different index
orders for example:

// slow:
for x:=0 to 199 do
for y:= 0 to 199 do
for z:= 0 to 199 do
array[z,y,x] := ...;

// 4x times faster:
for x:=0 to 199 do
for y:= 0 to 199 do
for z:= 0 to 199 do
array[x,y,z] := ...;

Test system is AMD X2 3800+ with Windows XP x64 Pro Edition, Code compiled
with Borland Delphi 2006 Win32.

I read about different row and column access times for memory chips, and
some time ago somebody mentioned communication between cpu and processor is
faster for sequential memory access...I thought it was fun to write a speed
test program to investigate this. I wrote a test program to try out all
index order combinations for four and three dimensional arrays and there is
indeed a substantial timing difference.

More information on this would be nice...

The last C compiler I bothered to check, the last subscript in [a,b,c] is
the one that "changes fastest", i.e., you can go from one to the next with
nothing but an increment. Then, increment the next one (second from the
right), loop, and so on.

If you index it to the middle (i.e., your first example) it has to reload
the pointers for every single store operation.

Hope This Helps!
Rich
 
D

David Brown

Ken said:
Because as soon as you access the first few bytes of a cache line
(typically 32 bytes or more), the whole line is pre-loaded and
can be accessed almost instantaneously.


Substituting "cache line" for "page", I think that's pretty much
exactly what happens on any desktop PC made in the last 15 years.

And if your arrays are big enough, you can include dram banks as well -
it is much more efficient to access memory in a dram bank that is
already open.
 
D

David Brown

Rich said:
Hello,

When accessing a multi dimensional array it can be done in different index
orders for example:

// slow:
for x:=0 to 199 do
for y:= 0 to 199 do
for z:= 0 to 199 do
array[z,y,x] := ...;

// 4x times faster:
for x:=0 to 199 do
for y:= 0 to 199 do
for z:= 0 to 199 do
array[x,y,z] := ...;

Test system is AMD X2 3800+ with Windows XP x64 Pro Edition, Code compiled
with Borland Delphi 2006 Win32.

I read about different row and column access times for memory chips, and
some time ago somebody mentioned communication between cpu and processor is
faster for sequential memory access...I thought it was fun to write a speed
test program to investigate this. I wrote a test program to try out all
index order combinations for four and three dimensional arrays and there is
indeed a substantial timing difference.

More information on this would be nice...

The last C compiler I bothered to check, the last subscript in [a,b,c] is
the one that "changes fastest", i.e., you can go from one to the next with
nothing but an increment. Then, increment the next one (second from the
right), loop, and so on.

If you index it to the middle (i.e., your first example) it has to reload
the pointers for every single store operation.

In C, it is *always* the last subscript that changes - it is not
dependant on the compiler. On the other hand, the code is written in
Pascal, not C, which AFAIK does not specify the ordering. On the third
hand, Delphi (the Pascal compiler in this case) does increment the last
subscript fastest.
 
Top