I currently write a program with large arrays and I am very confused about the processing time of the different arrays. On the one hand I have 7 "smaller" arrays (<=65536 elements) and on the other hand I have 7 large arrays (65536 < elements <= 67108864). Both are integer arrays. I only want to increment the various elements in a loop. Every loop I increase the elements at the same index of each array. This means every loop I want to increment all 14 array at the same index. This takes 21 seconds. If I only increment the 7 smaller arrays it takes only 2 seconds and if I only increment the 7 larger arrays it requires 18 seconds! For a better illustration I have written an example code:
#include <iostream>
#include <time.h>
using namespace std;
int main(int argc,char* argv[])
{
int i;
int* ExampleArray1 = new int [16384]; // 3 "small" arrays (2^14)
int* ExampleArray2 = new int [32768]; // (2^15)
int* ExampleArray3 = new int [65536]; // (2^16)
for(i=0 ; i<16384 ; i++) {ExampleArray1[i] = 0;}
for(i=0 ; i<32768 ; i++) {ExampleArray2[i] = 0;}
for(i=0 ; i<65536 ; i++) {ExampleArray3[i] = 0;}
int* ExampleArray4 = new int [16777216]; // 3 large arrays (2^24)
int* ExampleArray5 = new int [33554432]; // (2^25)
int* ExampleArray6 = new int [67108864]; // (2^26)
for(i=0 ; i<16777216 ; i++) {ExampleArray4[i] = 0;}
for(i=0 ; i<33554432 ; i++) {ExampleArray5[i] = 0;}
for(i=0 ; i<67108864 ; i++) {ExampleArray6[i] = 0;}
int until;
clock_t start,stop;
cout << "Until: "; cin >> until;
start = clock();
for(i=0 ; i<until; i++)
{
ExampleArray1[1]++;
ExampleArray2[1]++;
ExampleArray3[1]++;
}
stop = clock();
cout << "Time: " << static_cast<float>(stop-start)/CLOCKS_PER_SEC << " sec.\n";
start = clock();
for(i=0 ; i<until; i++)
{
ExampleArray4[1]++;
ExampleArray5[1]++;
ExampleArray6[1]++;
}
stop = clock();
cout << "Time: " << static_cast<float>(stop-start)/CLOCKS_PER_SEC << " sec.\n";
delete[] ExampleArray1; ExampleArray1 = NULL;
delete[] ExampleArray2; ExampleArray2 = NULL;
delete[] ExampleArray3; ExampleArray3 = NULL;
delete[] ExampleArray4; ExampleArray4 = NULL;
delete[] ExampleArray5; ExampleArray5 = NULL;
delete[] ExampleArray6; ExampleArray6 = NULL;
getchar();
return 0;
}
Another confusing point is the time difference between the compiling modes. As "until" I entered 1 billion.
Output -> Debug-Mode (standard):
Time: 6 sec.
Time: 26 sec.
Output -> Release-Mode (Fully Optimization enabled[/Ox] (I believe this is standard)):
Time: 34 sec.
Time: 47 sec.
Output -> Release-Mode (Optimization in property sheet disabled[/Od]):
Time: 25 sec.
Time: 25 sec.
But if I change the code a little bit and replace the increment by an assignment the output is a bit different:
[...]
int until;
int value;
clock_t start,stop;
cout << "Until: "; cin >> until;
cout << "Value: "; cin >> value;
start = clock();
for(i=0 ; i<until; i++)
{
ExampleArray1[1]=value;
ExampleArray2[1]=value;
ExampleArray3[1]=value;
}
stop = clock();
cout << "Time: " << static_cast<float>(stop-start)/CLOCKS_PER_SEC << " sec.\n";
start = clock();
for(i=0 ; i<until; i++)
{
ExampleArray4[1]=value;
ExampleArray5[1]=value;
ExampleArray6[1]=value;
}
stop = clock();
cout << "Time: " << static_cast<float>(stop-start)/CLOCKS_PER_SEC << " sec.\n";
[...]
Output -> Debug-Mode (standard):
Time: 4 sec.
Time: 28 sec.
Output -> Release-Mode (Fully Opitmization enabled[/Ox]):
Time: 38 sec.
Time: 38 sec.
Output -> Release-Mode (Optimization disabled[/Od]):
Time: 24 sec.
Time: 28 sec.
I thought that the access time of an array is always the same. I'm using Visual Studio 2012 32-Bit Express and Windows 7 64-Bit. I have everything tested several times. The times were approximately the same (but up to +/- 6 seconds) and I have taken a rounded average. My questions are:
Thanks in advance.
Edit:
The disassembly in debug mode is in each case (large and small arrays) the same. Array1 for example and the header of the first loop:
for (i = 0; i<until; i++)
01275FEA mov dword ptr [i],0
01275FF1 jmp main+21Ch (01275FFCh)
01275FF3 mov eax,dword ptr [i]
01275FF6 add eax,1
01275FF9 mov dword ptr [i],eax
01275FFC mov eax,dword ptr [i]
01275FFF cmp eax,dword ptr [until]
01276002 jge main+283h (01276063h)
{
ExampleArray1[1]++;
01276004 mov eax,4
01276009 shl eax,0
0127600C mov ecx,dword ptr [ExampleArray1]
0127600F mov edx,dword ptr [ecx+eax]
01276012 add edx,1
01276015 mov eax,4
0127601A shl eax,0
0127601D mov ecx,dword ptr [ExampleArray1]
01276020 mov dword ptr [ecx+eax],edx
In release mode (with optimization) the code of the first loop is shorter than the code of the second and the second header is twice in the disassembly (could this be the culprit ?). First loop:
for(i=0 ; i<until; i++)
00AD10D9 xor ecx,ecx
00AD10DB mov dword ptr [esp+24h],eax
00AD10DF cmp dword ptr [esp+10h],ecx
00AD10E3 jle main+0F5h (0AD10F5h)
{
ExampleArray1[1]++;
00AD10E5 inc dword ptr [esi+4]
ExampleArray2[1]++;
00AD10E8 inc dword ptr [ebx+4]
ExampleArray3[1]++;
00AD10EB inc dword ptr [edi+4]
00AD10EE inc ecx
00AD10EF cmp ecx,dword ptr [esp+10h]
00AD10F3 jl main+0E5h (0AD10E5h)
Second loop:
for(i=0 ; i<until; i++)
003B13B4 xor ecx,ecx
003B13B6 mov dword ptr [esp+24h],eax
003B13BA cmp dword ptr [esp+10h],ecx
003B13BE jle main+174h (03B13E4h)
for(i=0 ; i<until; i++)
003B13C0 mov eax,dword ptr [esp+1Ch]
003B13C4 mov edx,dword ptr [esp+18h]
003B13C8 mov edi,dword ptr [esp+20h]
003B13CC lea esp,[esp]
{
ExampleArray4[1]++;
003B13D0 inc dword ptr [eax+4]
ExampleArray5[1]++;
003B13D3 inc dword ptr [edx+4]
ExampleArray6[1]++;
003B13D6 inc dword ptr [edi+4]
003B13D9 inc ecx
003B13DA cmp ecx,dword ptr [esp+10h]
003B13DE jl main+160h (03B13D0h)
003B13E0 mov edi,dword ptr [esp+14h]
}
Thats the disassembly of release without opitmization -> First loop:
for(i=0 ; i<until; i++)
001A17A2 mov dword ptr [i],0
001A17A9 jmp main+1C4h (01A17B4h)
001A17AB mov edx,dword ptr [i]
001A17AE add edx,1
001A17B1 mov dword ptr [i],edx
001A17B4 mov eax,dword ptr [i]
001A17B7 cmp eax,dword ptr [until]
001A17BA jge main+22Bh (01A181Bh)
{
ExampleArray1[1]++;
001A17BC mov ecx,4
001A17C1 shl ecx,0
001A17C4 mov edx,dword ptr [ExampleArray1]
001A17C7 mov eax,dword ptr [edx+ecx]
001A17CA add eax,1
001A17CD mov ecx,4
001A17D2 shl ecx,0
001A17D5 mov edx,dword ptr [ExampleArray1]
001A17D8 mov dword ptr [edx+ecx],eax
ExampleArray2[1]++;
001A17DB mov eax,4
001A17E0 shl eax,0
001A17E3 mov ecx,dword ptr [ExampleArray2]
001A17E6 mov edx,dword ptr [ecx+eax]
001A17E9 add edx,1
001A17EC mov eax,4
001A17F1 shl eax,0
001A17F4 mov ecx,dword ptr [ExampleArray2]
001A17F7 mov dword ptr [ecx+eax],edx
ExampleArray3[1]++;
001A17FA mov edx,4
001A17FF shl edx,0
001A1802 mov eax,dword ptr [ExampleArray3]
001A1805 mov ecx,dword ptr [eax+edx]
001A1808 add ecx,1
001A180B mov edx,4
001A1810 shl edx,0
001A1813 mov eax,dword ptr [ExampleArray3]
001A1816 mov dword ptr [eax+edx],ecx
Second loop:
for(i=0 ; i<until; i++)
001A186F mov dword ptr [i],0
001A1876 jmp main+291h (01A1881h)
001A1878 mov eax,dword ptr [i]
001A187B add eax,1
001A187E mov dword ptr [i],eax
001A1881 mov ecx,dword ptr [i]
001A1884 cmp ecx,dword ptr [until]
001A1887 jge main+2F8h (01A18E8h)
{
ExampleArray4[1]++;
001A1889 mov edx,4
001A188E shl edx,0
001A1891 mov eax,dword ptr [ExampleArray4]
001A1894 mov ecx,dword ptr [eax+edx]
001A1897 add ecx,1
001A189A mov edx,4
001A189F shl edx,0
001A18A2 mov eax,dword ptr [ExampleArray4]
{
ExampleArray4[1]++;
001A18A5 mov dword ptr [eax+edx],ecx
ExampleArray5[1]++;
001A18A8 mov ecx,4
001A18AD shl ecx,0
001A18B0 mov edx,dword ptr [ExampleArray5]
001A18B3 mov eax,dword ptr [edx+ecx]
001A18B6 add eax,1
001A18B9 mov ecx,4
001A18BE shl ecx,0
001A18C1 mov edx,dword ptr [ExampleArray5]
001A18C4 mov dword ptr [edx+ecx],eax
ExampleArray6[1]++;
001A18C7 mov eax,4
001A18CC shl eax,0
001A18CF mov ecx,dword ptr [ExampleArray6]
001A18D2 mov edx,dword ptr [ecx+eax]
001A18D5 add edx,1
001A18D8 mov eax,4
001A18DD shl eax,0
001A18E0 mov ecx,dword ptr [ExampleArray6]
001A18E3 mov dword ptr [ecx+eax],edx
}
Debug mode: -same as in VS 2012-
In release mode (with optimization) it looks a bit different -> First loop:
for (i = 0; i<until; i++)
01391384 xor ecx,ecx
01391386 mov dword ptr [esp+10h],eax
0139138A cmp dword ptr [esp+20h],ecx
0139138E jle main+100h (013913A0h)
{
ExampleArray1[1]++;
01391390 inc dword ptr [esi+4]
01391393 inc ecx
ExampleArray2[1]++;
01391394 inc dword ptr [ebx+4]
ExampleArray3[1]++;
01391397 inc dword ptr [edi+4]
0139139A cmp ecx,dword ptr [esp+20h]
0139139E jl main+0F0h (01391390h)
}
Second loop (loop header is larger ? ):
for (i = 0; i<until; i++)
013913EF xor ecx,ecx
013913F1 mov dword ptr [esp+10h],eax
013913F5 cmp dword ptr [esp+20h],ecx
013913F9 jle main+17Bh (0139141Bh)
013913FB mov eax,dword ptr [esp+18h]
013913FF mov edx,dword ptr [esp+0Ch]
01391403 mov edi,dword ptr [esp+1Ch]
{
ExampleArray4[1]++;
01391407 inc dword ptr [eax+4]
0139140A inc ecx
ExampleArray5[1]++;
0139140B inc dword ptr [edx+4]
ExampleArray6[1]++;
0139140E inc dword ptr [edi+4]
01391411 cmp ecx,dword ptr [esp+20h]
01391415 jl main+167h (01391407h)
01391417 mov edi,dword ptr [esp+14h]
}
Release mode (without optimization) -> First loop:
for (i = 0; i<until; i++)
00DC182C mov dword ptr [i],0
00DC1833 jmp main+1CEh (0DC183Eh)
00DC1835 mov edx,dword ptr [i]
00DC1838 add edx,1
00DC183B mov dword ptr [i],edx
00DC183E mov eax,dword ptr [i]
00DC1841 cmp eax,dword ptr [until]
00DC1844 jge main+235h (0DC18A5h)
{
ExampleArray1[1]++;
00DC1846 mov ecx,4
00DC184B shl ecx,0
00DC184E mov edx,dword ptr [ExampleArray1]
00DC1851 mov eax,dword ptr [edx+ecx]
00DC1854 add eax,1
00DC1857 mov ecx,4
00DC185C shl ecx,0
00DC185F mov edx,dword ptr [ExampleArray1]
00DC1862 mov dword ptr [edx+ecx],eax
ExampleArray2[1]++;
00DC1865 mov eax,4
00DC186A shl eax,0
00DC186D mov ecx,dword ptr [ExampleArray2]
00DC1870 mov edx,dword ptr [ecx+eax]
00DC1873 add edx,1
00DC1876 mov eax,4
00DC187B shl eax,0
00DC187E mov ecx,dword ptr [ExampleArray2]
00DC1881 mov dword ptr [ecx+eax],edx
ExampleArray3[1]++;
00DC1884 mov edx,4
00DC1889 shl edx,0
00DC188C mov eax,dword ptr [ExampleArray3]
00DC188F mov ecx,dword ptr [eax+edx]
00DC1892 add ecx,1
00DC1895 mov edx,4
00DC189A shl edx,0
00DC189D mov eax,dword ptr [ExampleArray3]
00DC18A0 mov dword ptr [eax+edx],ecx
}
Second loop:
for (i = 0; i<until; i++)
00DC18F9 mov dword ptr [i],0
00DC1900 jmp main+29Bh (0DC190Bh)
00DC1902 mov eax,dword ptr [i]
00DC1905 add eax,1
00DC1908 mov dword ptr [i],eax
00DC190B mov ecx,dword ptr [i]
00DC190E cmp ecx,dword ptr [until]
00DC1911 jge main+302h (0DC1972h)
{
ExampleArray4[1]++;
00DC1913 mov edx,4
{
ExampleArray4[1]++;
00DC1918 shl edx,0
00DC191B mov eax,dword ptr [ExampleArray4]
00DC191E mov ecx,dword ptr [eax+edx]
00DC1921 add ecx,1
00DC1924 mov edx,4
00DC1929 shl edx,0
00DC192C mov eax,dword ptr [ExampleArray4]
00DC192F mov dword ptr [eax+edx],ecx
ExampleArray5[1]++;
00DC1932 mov ecx,4
00DC1937 shl ecx,0
00DC193A mov edx,dword ptr [ExampleArray5]
00DC193D mov eax,dword ptr [edx+ecx]
00DC1940 add eax,1
00DC1943 mov ecx,4
00DC1948 shl ecx,0
00DC194B mov edx,dword ptr [ExampleArray5]
00DC194E mov dword ptr [edx+ecx],eax
ExampleArray6[1]++;
00DC1951 mov eax,4
00DC1956 shl eax,0
00DC1959 mov ecx,dword ptr [ExampleArray6]
00DC195C mov edx,dword ptr [ecx+eax]
00DC195F add edx,1
00DC1962 mov eax,4
00DC1967 shl eax,0
00DC196A mov ecx,dword ptr [ExampleArray6]
00DC196D mov dword ptr [ecx+eax],edx
}
Edit: Well, I've done what I can. I'm more confused than before. To see the disassembly I have set a breakpoint in my code and pressed right mousbutton->Go to disassembly. Then I have seen the disassembly. In all this codes I have set the breakpoint at "ExampleArray2[1]++" (line 34). Now, I have noticed that the disassembly changes if I set the breakpoint at another array. Why ... is Visual Studio broken or something like this or is this normal? This Question is already really big. I have uploaded the disassembly of the clocks and the hard coded until at pastebin: http://pastebin.com/pxgegzZ6
IMPORTANT: I have made a hopefully helpful discovery in release mode. If I press the right mouse button in the second loop and press "run until cursor is reached" or set the cursor in the second loop and press Ctrl+F10 than the first loop only require 2.6 seconds ! If I set the cursor at the end of the code the first loop need 2.5-2.7 seconds but the second is as slow as before. In debug mode the times don't change.
This looks to me like a cache issue.
If, instead of
for(i=0 ; i<until; i++)
{
ExampleArray4[1]++;
ExampleArray5[1]++;
ExampleArray6[1]++;
}
you did
for(i=0 ; i<until; i++) ExampleArray4[1]++;
for(i=0 ; i<until; i++) ExampleArray5[1]++;
for(i=0 ; i<until; i++) ExampleArray6[1]++;
I predict it would be much faster.
The reason the cache is an issue is because every time you switch from working on one array to the next, a different location must be drawn from cache. By switching array so often you have a lot of cache misses. A similar issue occurs if you walk across rows of a 2d array rather than columns. Presumably, this doesn't affect your smaller arrays because they can all fit in cache.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With