Reason huge performance penalty for >4D arrays MatLab?

Tags:

Introduction

I have an algorithm that loops billions (trillions) of times and manipulates a matrix stored in 7 dimensions [10x10x10x10x10x10x10], I found that accessing elements in a 7-dimensional matrix was quite slow, curious as I am I ran some tests to identify the performance of accessing elements of multi-dimensional matrices.

Hypothesis

I was reminded that MatLab uses linear indexing under the hood and a friend of mine stated that the performance penalty was probably due to the conversion of 'normal' indexing to linear indexing under the hood Source.

Testing method

To test this hypothesis I tested accessing elements using linear indexing and regular indexing for 2D to 7D matrices. I varied the element I was accessing as well as the matrix size I was accessing, i.e. length of each dimension, but that didn't change the results significantly. The file I used for testing is found below. The hardware used is an Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70 GHz, 16GB RAM. The MatLab version is R2013B.

Results

Normal indexing 2D: 0.073659s
Linear indexing 2D: 0.064026s
Normal indexing 3D: 0.050719s
Linear indexing 3D: 0.064096s
Normal indexing 4D: 0.055674s
Linear indexing 4D: 0.062112s
Normal indexing 5D: 15.689907s
Linear indexing 5D: 5.265076s
Normal indexing 6D: 16.660496s
Linear indexing 6D: 5.295958s
Normal indexing 7D: 17.029072s
Linear indexing 7D: 5.291139s

It appears that normal indexing is very similar compared to linear indexing in terms of performance up to 4D matrices. For 3D and 4D matrices it appears slightly preferable to use normal indexing. Above 4D matrices linear indexing is highly preferable over normal indexing, but both normal indexing and regular indexing receive a huge performance penalty (~two orders of magnitude).

Conclusion

The moral of the story is to carefully consider the need for more than four dimensions in a matrix when running MatLab when aiming for high performance (apart from the obvious fact that C++ e.g. is for most applications much faster etc. but that's another discussion probably).

Question(s)

As in the title: what is the (potential) cause for this huge hit in performance when exceeding four dimensions in a matrix? And do other languages (such as C++) demonstrate this behaviour as well?

Code used for testing

clear all
clc

% Number if iterations 
n=10000000;

A = rand(10,10);
k1=sub2ind(size(A),3,2);
fprintf('\n')
tic
for ii=1:n
    A1=A(3,2);
end
a=toc;
fprintf('Normal indexing 2D: %fs\n',a)


tic
for ii=1:n
    A2=A(k1);
end
a=toc;
fprintf('Linear indexing 2D: %fs\n',a)

B = rand(10,10,10);
k2=sub2ind(size(B),3,2,1);

tic
for ii=1:n
    B1=B(3,2,1);
end
a=toc;
fprintf('Normal indexing 3D: %fs\n',a)
tic
for ii=1:n
    B2=B(k2);
end
a=toc;
fprintf('Linear indexing 3D: %fs\n',a)

C = rand(10,10,10,10);
k3=sub2ind(size(C),3,2,1,10);

tic
for ii=1:n
    C1=C(3,2,1,10);
end
a=toc;
fprintf('Normal indexing 4D: %fs\n',a)
tic
for ii=1:n
    C2=C(k3);
end
a=toc;
fprintf('Linear indexing 4D: %fs\n',a)

D = rand(10,10,10,10,10);
k4=sub2ind(size(D),3,2,1,10,1);

tic
for ii=1:n
    D1=D(3,2,1,10,1);
end
a=toc;
fprintf('Normal indexing 5D: %fs\n',a)
tic
for ii=1:n
    D2=D(k4);
end
a=toc;
fprintf('Linear indexing 5D: %fs\n',a)

E = rand(10,10,10,10,10,10);
k5=sub2ind(size(E),3,2,1,10,1,2);

tic
for ii=1:n
    E1=E(3,2,1,10,1,2);
end
a=toc;
fprintf('Normal indexing 6D: %fs\n',a)
tic
for ii=1:n
    E2=E(k5);
end
a=toc;
fprintf('Linear indexing 6D: %fs\n',a)

F = rand(10,10,10,10,10,10,10);
k6=sub2ind(size(F),3,2,1,10,1,2,3);

tic
for ii=1:n
    F1=F(3,2,1,10,1,2);
end
a=toc;
fprintf('Normal indexing 7D: %fs\n',a)
tic
for ii=1:n
    F2=F(k6);
end
a=toc;
fprintf('Linear indexing 7D: %fs\n',a)

478

asked Jul 25 '14 13:07

EJG89

1 Answers

At some point, there must be an linear address for a value, because that's how most processors work. The processors are great at fetching data from a single location.

Throughout this answer, the terms matrix and array will be used interchangably.

Single Dimension Matrix
The address calculation for a single dimension array:

  address = starting_array_address + (index * sizeof(object_type));

The object_type is the type of the object, such as unsigned char or int.

2 Dimensional Matrix
The calculation for the index position of a value in a 2D array represented as a 1D array is:

  index = Dimension2_value * Dimension1_capacity + Dimension1_value;

The address of the value can be calculated using the index and the equation for the Single Dimensional object.

3 Dimensional Matrix
The Dimension3_value can be represented as the number of planes inside a cube.
So, this extends the equation by another term:

index = Dimension_3_value * (Dimension2_capacity)
      + Dimension2_value * Dimension1_capacity
      + Dimension1_value;

4D & above Matrices The calculation for 4D and above matrices can be extended in a similar manner.

The bottleneck in the index calculation are the multiplications.

Some platforms can calculate the terms in parallel, thus giving some speed advantages.

Speed speculations
I guess that Matlab has optimized its index calculations for 2D and 3D matrices but not larger ones. The 2D and 3D matrices are more widely used than 4D and larger.

A typical optimization would be to assign the values of each term to different registers, then sum up the registers. This is to keep the values as close to the processor as possible. Some processors may not have enough registers for the 4D calculation and so the compiler may have to place the values in temporary memory (outside the processor), thus slowing down performance.

Fetching the value
For 1D, 2D and 3D matrices, the processor may be able to contain the data in its data cache, thus speeding up performances. As the dimensions grow, the size of matrix must shrink in order for it to fit into the processor's data cache. Anytime that a value is not in the cache is called a cache-miss and the processor must load the value from memory external to the processor. Depending on how the data is reloaded, it may reload a "strip" of data or a lot more. In either case, it's a reduction in performance.

Summary
Calculating the index takes more processing time for each additional dimension of a matrix. The multiplications are the bottleneck of the calculation (even though they may be one instruction, a multiplication takes longer than an addition). Although each term of the index calculation can be performed in parallel, there may not be enough threads or cores or processing units to perform the index calculation of all the dimensions at one time. There may not be enough registers or data cache for the optimal calculation, so the processor will have to fetch from external memory, slowing down performance. If the data item is not in the processor's data cache, the processor will waste time reloading the entire or portions of the data cache (the time is wasted because the processor could be performing calculations instead of reloading the cache).

In general, operating on determinants (smaller sub-matrices) is usually usually results in better performance than operating on a large multi-dimensional matrix.

118

answered Sep 28 '22 22:09

Thomas Matthews

Related questions
                            
                                Android NDK app not able to hit any breakpoint
                            
                                How can I retrieve the last argument of a C99 variadic macro?
                            
                                linker cannot find dlsym
                            
                                How to initialize a pointer using an object with 0 value [duplicate]
                            
                                OpenMP atomic on a reference type?
                            
                                implement _mm256_permutevar8x32_ps using AVX instructions
                            
                                C++ std lib <mutex>, <conditional_variable> libs and shared memory
                            
                                getting parameter information from FunctionDecl class in clang
                            
                                QML - Passing Javascript associative array to C++
                            
                                QLabel margin and style sheet padding in labels in Qt
                            
                                ThreadSanitizer reports "data race on operator delete(void*)" when using embedded reference counter
                            
                                Return value of istream::get()
                            
                                Undefined reference to 'main' after adding a class to project
                            
                                OpenCV Sift/Surf/Orb : drawMatch function is not working well
                            
                                Turbo C++ and Code blocks
                            
                                are there center-allocation deque or vector in STL implementations?
                            
                                How to tell if shift is pressed on numpad input with NumLock on? Or at least get NumLock status?
                            
                                How to declare input into Rcpp functions?
                            
                                What is the difference between a macro and a const in C++?
                            
                                Boost c++ library install

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reason huge performance penalty for >4D arrays MatLab?

Tags:

c++

performance

multidimensional-array

indexing

matlab

EJG89

People also ask

1 Answers

Thomas Matthews

Recent Activity

Donate For Us