Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are these computation speeds different for multi-dimensional array in C++? [duplicate]

This might be a repeat question. So, please feel free to flag it down if you want. In C++, I learnt that array dimensions are stored consecutively in memory How are 3D arrays stored in C? so i did a little experiment to assign natural numbers to a matrix of size 1600000000x1 and 1x1600000000 (please change matsize in the code to smaller value depending on your memory). The code below assigns a natural numbers from 1 to 1600000000 to the matrix a (whose dimensions are 1x1600000000) and computes sum of cubes of all the elements. The opposite case is just reversing the dimensions of matrix which i do by changing xdim to matsize and ydim to 1, and recompiling the code and running it again. The matrix is [xdim][ydim]

#include <iostream>
#include <time.h>
using namespace std;
int main()
{
long int matsize, i, j, xdim, ydim;
long double ss;
double** a;
double time1, time2, time3;
clock_t starttime = clock();    
matsize=1600000000;
xdim=1;
ydim=matsize;   
ss=0.0;    
a= new double *[xdim];
for(i=0;i<xdim;i++)
{
    a[i]= new double[ydim];
}

time1= (double)( clock() - starttime ) / (double)CLOCKS_PER_SEC;    
cout << "allocated. time taken for allocation was " << time1 <<" seconds. computation started" << endl;    
for(i=0;i<xdim;i++)
{
    for(j=0;j<ydim;j++)
    {
        a[i][j]=(i+1)*(j+1);
        ss=ss+a[i][j]*a[i][j]*a[i][j];
    }
}
cout << "last number is " << a[xdim-1][ydim-1] << " . sum is " << ss << endl;
time2= ((double)( clock() - starttime ) / (double)CLOCKS_PER_SEC) - time1;
cout << "computation done. time taken for computation was " << time2 << " seconds" << endl;    
for(i=0;i<xdim;i++)
{
    delete [] a[i];
}
delete [] a;   
time3= ((double)( clock() - starttime ) / (double)CLOCKS_PER_SEC) - time2;
cout << "deallocated. time taken for deallocation was " << time3 << " seconds" << endl;    
cout << "the total time taken is " << (double)( clock() - starttime ) / (double)CLOCKS_PER_SEC << endl;
cout << "or " << time1+time2+time3 << " seconds" << endl; 
return 0;
}

My results for the two cases are -

Case 1: xdim=1 and ydim=1600000000

allocated. time taken for allocation was 4.5e-05 seconds. computation started last number is 1.6e+09 . sum is 1.6384e+36 computation done. time taken for computation was 14.7475 seconds deallocated. time taken for deallocation was 0.875754 seconds the total time taken is 15.6233 or 15.6233 seconds

Case 2: xdim=1600000000 and ydim=1

allocated. time taken for allocation was 56.1583 seconds. computation started last number is 1.6e+09 . sum is 1.6384e+36 computation done. time taken for computation was 50.7347 seconds deallocated. time taken for deallocation was 270.038 seconds the total time taken is 320.773 or 376.931 seconds

The output sum is same in both cases. I can understand that time taken for allocation and deallocation of memory is different in both cases, but why is the computation time so much different too if memory allocation is continuous ? What is wrong in this code ?

If it matters, i am using g++ on Mountain Lion and compile using g++ -std=c++11, i7 quad core processor, 16 GB RAM

like image 695
Guddu Avatar asked Feb 16 '23 07:02

Guddu


2 Answers

Each individual vector stores content contiguously, including the vector of pointers to vectors, but the address from successive calls you make to new aren't contiguous and neither are the calls vector makes internally to create its buffer. So, if you have a huge vector of pointers to tiny vectors, your memory is effectively non-contiguous and you won't get good cache hits. If you have a single-element vector to a huge vector, then the memory is contiguous and the cache will work nicely.

Visually, the fast/contiguous layout is:

*a--[0]
.    |
.   [0][1][2][3][4][5][...]

Your slow alternative is:

.        [0]     [0]
.         \       /
*a--[0][1][2][3][4][5][...]
.    |   \    |     \
.   [0]   \[0][0]   [0]

Multidimensional arrays can be created on the stack using e.g.

int x[10][20];

In which case the memory will be contiguous, with memory in each of x[0], x[1] etc. contiguous. (So x[0][0] is before x[0][1], not immediately before x[1][0].)

To effectively have a contiguous multi-dimensional array on the heap, you should new a vector with the product of the intended dimensions, then write a wrapper class that conveniently multiplies out the dimensions to find a particular element.

like image 194
Tony Delroy Avatar answered Apr 27 '23 00:04

Tony Delroy


The computation time is different because of data caching. Knowing about the locality of reference, CPU loads data from adjacent addresses when you read a location in memory. It predicts that the next read is going to be from the location just a few bytes forward of the address that you just read.

When the array is allocated as [1][N], the elements are indeed stored consecutively, so CPU's predictions come true nearly all the time. The data that you need is nearly always available from the CPU's cache, which is several times faster than the main memory. CPU continues loading locations forward of where you just read while it performs the computation, so loading the new data and adding up the numbers continues in parallel.

When you switch dimensions around, the numbers that you add up are no longer in consecutive locations. This is because consecutive calls to new will not allocate data in consecutive memory regions: memory management libraries adds several bytes for "bookkeeping" purposes, and it always allocates chunks of memory of some minimum size, which is often greater than double. When you request a chunk that is smaller than the minimum, the allocated region is padded. As the result, your doubles could end up being up to twenty bytes apart in the best-case scenario* - enough to nullify the effects of reading ahead from adjacent memory locations. That's why the CPU is forced to wait while it loads data from a different location. This slows down the calculation several times.


* In the worst case, the values could be placed arbitrarily far, depending on the allocations and deallocations performed prior to running your code.
like image 45
Sergey Kalinichenko Avatar answered Apr 27 '23 00:04

Sergey Kalinichenko