Increase of access time with large array indexes

Question

Problem

I currently write a program with large arrays and I am very confused about the processing time of the different arrays. On the one hand I have 7 "smaller" arrays (<=65536 elements) and on the other hand I have 7 large arrays (65536 < elements <= 67108864). Both are integer arrays. I only want to increment the various elements in a loop. Every loop I increase the elements at the same index of each array. This means every loop I want to increment all 14 array at the same index. This takes 21 seconds. If I only increment the 7 smaller arrays it takes only 2 seconds and if I only increment the 7 larger arrays it requires 18 seconds! For a better illustration I have written an example code:

#include <iostream>
#include <time.h>

using namespace std;

int main(int argc,char* argv[])
{
    int i;

    int* ExampleArray1 = new int [16384];       // 3 "small" arrays (2^14)
    int* ExampleArray2 = new int [32768];       //                  (2^15)
    int* ExampleArray3 = new int [65536];       //                  (2^16)
    for(i=0 ; i<16384 ; i++) {ExampleArray1[i] = 0;}
    for(i=0 ; i<32768 ; i++) {ExampleArray2[i] = 0;}
    for(i=0 ; i<65536 ; i++) {ExampleArray3[i] = 0;}

    int* ExampleArray4 = new int [16777216];    // 3 large arrays   (2^24)
    int* ExampleArray5 = new int [33554432];    //                  (2^25)
    int* ExampleArray6 = new int [67108864];    //                  (2^26)
    for(i=0 ; i<16777216 ; i++) {ExampleArray4[i] = 0;}
    for(i=0 ; i<33554432 ; i++) {ExampleArray5[i] = 0;}
    for(i=0 ; i<67108864 ; i++) {ExampleArray6[i] = 0;}

    int until;

    clock_t start,stop;
    cout << "Until: "; cin >> until;

    start = clock();
    for(i=0 ; i<until; i++)
    {
        ExampleArray1[1]++;
        ExampleArray2[1]++;
        ExampleArray3[1]++;
    }
    stop = clock();
    cout << "Time: " << static_cast<float>(stop-start)/CLOCKS_PER_SEC << " sec.
";

    start = clock();
    for(i=0 ; i<until; i++)
    {
        ExampleArray4[1]++;
        ExampleArray5[1]++;
        ExampleArray6[1]++;
    }
    stop = clock();
    cout << "Time: " << static_cast<float>(stop-start)/CLOCKS_PER_SEC << " sec.
";

    delete[] ExampleArray1; ExampleArray1 = NULL;
    delete[] ExampleArray2; ExampleArray2 = NULL;
    delete[] ExampleArray3; ExampleArray3 = NULL;
    delete[] ExampleArray4; ExampleArray4 = NULL;
    delete[] ExampleArray5; ExampleArray5 = NULL;
    delete[] ExampleArray6; ExampleArray6 = NULL;

    getchar();
    return 0;
}

Another confusing point is the time difference between the compiling modes. As "until" I entered 1 billion.

Output -> Debug-Mode (standard):

Time: 6 sec.
Time: 26 sec.

Output -> Release-Mode (Fully Optimization enabled[/Ox] (I believe this is standard)):

Time: 34 sec.
Time: 47 sec.

Output -> Release-Mode (Optimization in property sheet disabled[/Od]):

Time: 25 sec.
Time: 25 sec.

But if I change the code a little bit and replace the increment by an assignment the output is a bit different:

[...]
int until;
int value;

clock_t start,stop;
cout << "Until: "; cin >> until;
cout << "Value: "; cin >> value;

start = clock();
for(i=0 ; i<until; i++)
{
    ExampleArray1[1]=value;
    ExampleArray2[1]=value;
    ExampleArray3[1]=value;
}
stop = clock();
cout << "Time: " << static_cast<float>(stop-start)/CLOCKS_PER_SEC << " sec.
";

start = clock();
for(i=0 ; i<until; i++)
{
    ExampleArray4[1]=value;
    ExampleArray5[1]=value;
    ExampleArray6[1]=value;
}
stop = clock();
cout << "Time: " << static_cast<float>(stop-start)/CLOCKS_PER_SEC << " sec.
";
[...]

Output -> Debug-Mode (standard):

Time: 4 sec.
Time: 28 sec.

Output -> Release-Mode (Fully Opitmization enabled[/Ox]):

Time: 38 sec.
Time: 38 sec.

Output -> Release-Mode (Optimization disabled[/Od]):

Time: 24 sec.
Time: 28 sec.

I thought that the access time of an array is always the same. I'm using Visual Studio 2012 32-Bit Express and Windows 7 64-Bit. I have everything tested several times. The times were approximately the same (but up to +/- 6 seconds) and I have taken a rounded average. My questions are:

Questions

Why the times differs so much with the various array sizes ? -> Is this avoidable without splitting the large arrays ?
Why the debug mode requires less time than the release mode ?
What can be the reasons that the release mode without optimization is faster than with optimization ?

Thanks in advance.

Edit:

System data:

Prozessor: AMD Phenom II X6 1045T (6 x 2,7GHz)
L1 Cache: 768 KB (2x6x64 KB) 2-way | L2 Cache: 3072 KB (6x512 KB) 16-way | L3 Cache: 6 MB 48-way associative
RAM: 2x 4GB DDR3
SSD hard drive

Disassembly (of the first code with incrementation)

Disassembly - VS 2012 (Express)

The disassembly in debug mode is in each case (large and small arrays) the same. Array1 for example and the header of the first loop:

    for (i = 0; i<until; i++)
01275FEA  mov         dword ptr [i],0  
01275FF1  jmp         main+21Ch (01275FFCh)  
01275FF3  mov         eax,dword ptr [i]  
01275FF6  add         eax,1  
01275FF9  mov         dword ptr [i],eax  
01275FFC  mov         eax,dword ptr [i]  
01275FFF  cmp         eax,dword ptr [until]  
01276002  jge         main+283h (01276063h)  
    {
        ExampleArray1[1]++;
01276004  mov         eax,4  
01276009  shl         eax,0  
0127600C  mov         ecx,dword ptr [ExampleArray1]  
0127600F  mov         edx,dword ptr [ecx+eax]  
01276012  add         edx,1  
01276015  mov         eax,4  
0127601A  shl         eax,0  
0127601D  mov         ecx,dword ptr [ExampleArray1]  
01276020  mov         dword ptr [ecx+eax],edx

In release mode (with optimization) the code of the first loop is shorter than the code of the second and the second header is twice in the disassembly (could this be the culprit ?). First loop:

    for(i=0 ; i<until; i++)
00AD10D9  xor         ecx,ecx  
00AD10DB  mov         dword ptr [esp+24h],eax  
00AD10DF  cmp         dword ptr [esp+10h],ecx  
00AD10E3  jle         main+0F5h (0AD10F5h)  
    {
        ExampleArray1[1]++;
00AD10E5  inc         dword ptr [esi+4]  
        ExampleArray2[1]++;
00AD10E8  inc         dword ptr [ebx+4]  
        ExampleArray3[1]++;
00AD10EB  inc         dword ptr [edi+4]  
00AD10EE  inc         ecx  
00AD10EF  cmp         ecx,dword ptr [esp+10h]  
00AD10F3  jl          main+0E5h (0AD10E5h)

Second loop:

        for(i=0 ; i<until; i++)
003B13B4  xor         ecx,ecx  
003B13B6  mov         dword ptr [esp+24h],eax  
003B13BA  cmp         dword ptr [esp+10h],ecx  
003B13BE  jle         main+174h (03B13E4h)  
        for(i=0 ; i<until; i++)
003B13C0  mov         eax,dword ptr [esp+1Ch]  
003B13C4  mov         edx,dword ptr [esp+18h]  
003B13C8  mov         edi,dword ptr [esp+20h]  
003B13CC  lea         esp,[esp]  
    {
        ExampleArray4[1]++;
003B13D0  inc         dword ptr [eax+4]  
        ExampleArray5[1]++;
003B13D3  inc         dword ptr [edx+4]  
        ExampleArray6[1]++;
003B13D6  inc         dword ptr [edi+4]  
003B13D9  inc         ecx  
003B13DA  cmp         ecx,dword ptr [esp+10h]  
003B13DE  jl          main+160h (03B13D0h)  
003B13E0  mov         edi,dword ptr [esp+14h]  
    }

Thats the disassembly of release without opitmization -> First loop:

    for(i=0 ; i<until; i++)
001A17A2  mov         dword ptr [i],0  
001A17A9  jmp         main+1C4h (01A17B4h)  
001A17AB  mov         edx,dword ptr [i]  
001A17AE  add         edx,1  
001A17B1  mov         dword ptr [i],edx  
001A17B4  mov         eax,dword ptr [i]  
001A17B7  cmp         eax,dword ptr [until]  
001A17BA  jge         main+22Bh (01A181Bh)  
    {
        ExampleArray1[1]++;
001A17BC  mov         ecx,4  
001A17C1  shl         ecx,0  
001A17C4  mov         edx,dword ptr [ExampleArray1]  
001A17C7  mov         eax,dword ptr [edx+ecx]  
001A17CA  add         eax,1  
001A17CD  mov         ecx,4  
001A17D2  shl         ecx,0  
001A17D5  mov         edx,dword ptr [ExampleArray1]  
001A17D8  mov         dword ptr [edx+ecx],eax  
        ExampleArray2[1]++;
001A17DB  mov         eax,4  
001A17E0  shl         eax,0  
001A17E3  mov         ecx,dword ptr [ExampleArray2]  
001A17E6  mov         edx,dword ptr [ecx+eax]  
001A17E9  add         edx,1  
001A17EC  mov         eax,4  
001A17F1  shl         eax,0  
001A17F4  mov         ecx,dword ptr [ExampleArray2]  
001A17F7  mov         dword ptr [ecx+eax],edx  
        ExampleArray3[1]++;
001A17FA  mov         edx,4  
001A17FF  shl         edx,0  
001A1802  mov         eax,dword ptr [ExampleArray3]  
001A1805  mov         ecx,dword ptr [eax+edx]  
001A1808  add         ecx,1  
001A180B  mov         edx,4  
001A1810  shl         edx,0  
001A1813  mov         eax,dword ptr [ExampleArray3]  
001A1816  mov         dword ptr [eax+edx],ecx

Second loop:

    for(i=0 ; i<until; i++)
001A186F  mov         dword ptr [i],0  
001A1876  jmp         main+291h (01A1881h)  
001A1878  mov         eax,dword ptr [i]  
001A187B  add         eax,1  
001A187E  mov         dword ptr [i],eax  
001A1881  mov         ecx,dword ptr [i]  
001A1884  cmp         ecx,dword ptr [until]  
001A1887  jge         main+2F8h (01A18E8h)  
    {
        ExampleArray4[1]++;
001A1889  mov         edx,4  
001A188E  shl         edx,0  
001A1891  mov         eax,dword ptr [ExampleArray4]  
001A1894  mov         ecx,dword ptr [eax+edx]  
001A1897  add         ecx,1  
001A189A  mov         edx,4  
001A189F  shl         edx,0  
001A18A2  mov         eax,dword ptr [ExampleArray4]  
    {
        ExampleArray4[1]++;
001A18A5  mov         dword ptr [eax+edx],ecx  
        ExampleArray5[1]++;
001A18A8  mov         ecx,4  
001A18AD  shl         ecx,0  
001A18B0  mov         edx,dword ptr [ExampleArray5]  
001A18B3  mov         eax,dword ptr [edx+ecx]  
001A18B6  add         eax,1  
001A18B9  mov         ecx,4  
001A18BE  shl         ecx,0  
001A18C1  mov         edx,dword ptr [ExampleArray5]  
001A18C4  mov         dword ptr [edx+ecx],eax  
        ExampleArray6[1]++;
001A18C7  mov         eax,4  
001A18CC  shl         eax,0  
001A18CF  mov         ecx,dword ptr [ExampleArray6]  
001A18D2  mov         edx,dword ptr [ecx+eax]  
001A18D5  add         edx,1  
001A18D8  mov         eax,4  
001A18DD  shl         eax,0  
001A18E0  mov         ecx,dword ptr [ExampleArray6]  
001A18E3  mov         dword ptr [ecx+eax],edx  
    }

Disassembly - VS 2013 (Express)

Debug mode: -same as in VS 2012-

In release mode (with optimization) it looks a bit different -> First loop:

    for (i = 0; i<until; i++)
01391384  xor         ecx,ecx  
01391386  mov         dword ptr [esp+10h],eax  
0139138A  cmp         dword ptr [esp+20h],ecx  
0139138E  jle         main+100h (013913A0h)  
    {
        ExampleArray1[1]++;
01391390  inc         dword ptr [esi+4]  
01391393  inc         ecx  
        ExampleArray2[1]++;
01391394  inc         dword ptr [ebx+4]  
        ExampleArray3[1]++;
01391397  inc         dword ptr [edi+4]  
0139139A  cmp         ecx,dword ptr [esp+20h]  
0139139E  jl          main+0F0h (01391390h)  
    }

Second loop (loop header is larger ? ):

    for (i = 0; i<until; i++)
013913EF  xor         ecx,ecx  
013913F1  mov         dword ptr [esp+10h],eax  
013913F5  cmp         dword ptr [esp+20h],ecx  
013913F9  jle         main+17Bh (0139141Bh)  
013913FB  mov         eax,dword ptr [esp+18h]  
013913FF  mov         edx,dword ptr [esp+0Ch]  
01391403  mov         edi,dword ptr [esp+1Ch]  
    {
        ExampleArray4[1]++;
01391407  inc         dword ptr [eax+4]  
0139140A  inc         ecx  
        ExampleArray5[1]++;
0139140B  inc         dword ptr [edx+4]  
        ExampleArray6[1]++;
0139140E  inc         dword ptr [edi+4]  
01391411  cmp         ecx,dword ptr [esp+20h]  
01391415  jl          main+167h (01391407h)  
01391417  mov         edi,dword ptr [esp+14h]  
    }

Release mode (without optimization) -> First loop:

    for (i = 0; i<until; i++)
00DC182C  mov         dword ptr [i],0  
00DC1833  jmp         main+1CEh (0DC183Eh)  
00DC1835  mov         edx,dword ptr [i]  
00DC1838  add         edx,1  
00DC183B  mov         dword ptr [i],edx  
00DC183E  mov         eax,dword ptr [i]  
00DC1841  cmp         eax,dword ptr [until]  
00DC1844  jge         main+235h (0DC18A5h)  
    {
        ExampleArray1[1]++;
00DC1846  mov         ecx,4  
00DC184B  shl         ecx,0  
00DC184E  mov         edx,dword ptr [ExampleArray1]  
00DC1851  mov         eax,dword ptr [edx+ecx]  
00DC1854  add         eax,1  
00DC1857  mov         ecx,4  
00DC185C  shl         ecx,0  
00DC185F  mov         edx,dword ptr [ExampleArray1]  
00DC1862  mov         dword ptr [edx+ecx],eax  
        ExampleArray2[1]++;
00DC1865  mov         eax,4  
00DC186A  shl         eax,0  
00DC186D  mov         ecx,dword ptr [ExampleArray2]  
00DC1870  mov         edx,dword ptr [ecx+eax]  
00DC1873  add         edx,1  
00DC1876  mov         eax,4  
00DC187B  shl         eax,0  
00DC187E  mov         ecx,dword ptr [ExampleArray2]  
00DC1881  mov         dword ptr [ecx+eax],edx  
        ExampleArray3[1]++;
00DC1884  mov         edx,4  
00DC1889  shl         edx,0  
00DC188C  mov         eax,dword ptr [ExampleArray3]  
00DC188F  mov         ecx,dword ptr [eax+edx]  
00DC1892  add         ecx,1  
00DC1895  mov         edx,4  
00DC189A  shl         edx,0  
00DC189D  mov         eax,dword ptr [ExampleArray3]  
00DC18A0  mov         dword ptr [eax+edx],ecx  
        }

Second loop:

    for (i = 0; i<until; i++)
00DC18F9  mov         dword ptr [i],0  
00DC1900  jmp         main+29Bh (0DC190Bh)  
00DC1902  mov         eax,dword ptr [i]  
00DC1905  add         eax,1  
00DC1908  mov         dword ptr [i],eax  
00DC190B  mov         ecx,dword ptr [i]  
00DC190E  cmp         ecx,dword ptr [until]  
00DC1911  jge         main+302h (0DC1972h)  
    {
        ExampleArray4[1]++;
 00DC1913  mov         edx,4  
    {
        ExampleArray4[1]++;
00DC1918  shl         edx,0  
00DC191B  mov         eax,dword ptr [ExampleArray4]  
00DC191E  mov         ecx,dword ptr [eax+edx]  
00DC1921  add         ecx,1  
00DC1924  mov         edx,4  
00DC1929  shl         edx,0  
00DC192C  mov         eax,dword ptr [ExampleArray4]  
00DC192F  mov         dword ptr [eax+edx],ecx  
        ExampleArray5[1]++;
00DC1932  mov         ecx,4  
00DC1937  shl         ecx,0  
00DC193A  mov         edx,dword ptr [ExampleArray5]  
00DC193D  mov         eax,dword ptr [edx+ecx]  
00DC1940  add         eax,1  
00DC1943  mov         ecx,4  
00DC1948  shl         ecx,0  
00DC194B  mov         edx,dword ptr [ExampleArray5]  
00DC194E  mov         dword ptr [edx+ecx],eax  
        ExampleArray6[1]++;
00DC1951  mov         eax,4  
00DC1956  shl         eax,0  
00DC1959  mov         ecx,dword ptr [ExampleArray6]  
00DC195C  mov         edx,dword ptr [ecx+eax]  
00DC195F  add         edx,1  
00DC1962  mov         eax,4  
00DC1967  shl         eax,0  
00DC196A  mov         ecx,dword ptr [ExampleArray6]  
00DC196D  mov         dword ptr [ecx+eax],edx  
    }

Edit: Well, I've done what I can. I'm more confused than before. To see the disassembly I have set a breakpoint in my code and pressed right mousbutton->Go to disassembly. Then I have seen the disassembly. In all this codes I have set the breakpoint at "ExampleArray2[1]++" (line 34). Now, I have noticed that the disassembly changes if I set the breakpoint at another array. Why ... is Visual Studio broken or something like this or is this normal? This Question is already really big. I have uploaded the disassembly of the clocks and the hard coded until at pastebin: http://pastebin.com/pxgegzZ6

IMPORTANT: I have made a hopefully helpful discovery in release mode. If I press the right mouse button in the second loop and press "run until cursor is reached" or set the cursor in the second loop and press Ctrl+F10 than the first loop only require 2.6 seconds ! If I set the cursor at the end of the code the first loop need 2.5-2.7 seconds but the second is as slow as before. In debug mode the times don't change.

Nick Ayres · Accepted Answer

This looks to me like a cache issue.

If, instead of

for(i=0 ; i<until; i++)
{
    ExampleArray4[1]++;
    ExampleArray5[1]++;
    ExampleArray6[1]++;
}

you did

for(i=0 ; i<until; i++) ExampleArray4[1]++;
for(i=0 ; i<until; i++) ExampleArray5[1]++;
for(i=0 ; i<until; i++) ExampleArray6[1]++;

I predict it would be much faster.

The reason the cache is an issue is because every time you switch from working on one array to the next, a different location must be drawn from cache. By switching array so often you have a lot of cache misses. A similar issue occurs if you walk across rows of a 2d array rather than columns. Presumably, this doesn't affect your smaller arrays because they can all fit in cache.

Increase of access time with large array indexes

Tags:

c++

arrays

assembly

visual-studio-2012

disassembly

Problem

Questions

System data:

Disassembly (of the first code with incrementation)

Disassembly - VS 2012 (Express)

Disassembly - VS 2013 (Express)

Smith

1 Answers

Nick Ayres

Recent Activity

Donate For Us

Increase of access time with large array indexes

Tags:

c++

arrays

assembly

visual-studio-2012

disassembly

Problem

Questions

System data:

Disassembly (of the first code with incrementation)

Disassembly - VS 2012 (Express)

Disassembly - VS 2013 (Express)

Smith

1 Answers

Nick Ayres

Related questions

Recent Activity

Donate For Us