Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MATLAB is too slow calculation of Spearman's rank correlation for 9-element vectors

I need to calculate the Spearman's rank correlation (using corr function) for pairs of vectors with different lengths (for example 5-element vectors to 20-element vectors). The number of pairs is usually above 300 pairs for each length. I track the progress with waitbar. I have noticed that it takes unusually very long time for 9-element pair of vectors, where for other lengths (greater and smaller) it takes very short times. Since the formula is exactly the same, the problem must have originated in MATLAB function corr.

I wrote the following code to verify that the problem is with corr function and not other calculations that I have besides 'corr', where all of that calculations (including 'corr') take place inside some 2 or 3 'for' loops. The code repeats the timing 50 times to avoid accidental results.

The result is a bar graph, confirming that it takes a long time for MATLAB to calculate Spearman's rank correlation for 9-element vectors. Since my calculations are not that heavy, this problem does not cause endless wait, it just increases the total time consumed for the whole process. Can someone tell me that what causes the problem and how to avoid it?

Times1 = zeros(20,50);

for i = 5:20
    for j = 1:50
        tic
        A = rand(i,2);
        [r,p] = corr(A(:,1),A(:,2),'type','Spearman');
        Times1(i,j) = toc;
    end
end

Times2 = mean(Times1,2);

bar(Times2);
xticks(1:25);
xlabel('number of elements in vectors');
ylabel('average time');
like image 298
KareeM Avatar asked Oct 17 '22 23:10

KareeM


1 Answers

After some investigation, I think I found the root of this very interesting problem. My tests have been conducted profiling every outer iteration using the built-in Matlab profiler, as follows:

res = cell(20,1);

for i = 5:20
    profile clear;
    profile on -history;

    for j = 1:50
        uni = rand(i,2);
        corr(uni(:,1),uni(:,2),'type','Spearman');
    end

    profile off;
    p = profile('info');

    res{i} = p.FunctionTable;
end

The produced output looks like this:

Output

The first thing I noticed is that the Spearman correlation for matrices with a number of rows less than or equal to 9 is computed in a different way than for matrices with 10 or more rows. For the former, the functions being internally called by the corr function are:

 Function                Number of Calls
----------------------- -----------------
'factorial'             100
'tiedrank>tr'           100
'tiedrank'              100
'corr>pvalSpearman'     50
'corr>rcumsum'          50
'perms>permsr'          50
'perms'                 50
'corr>spearmanExactSub' 50
'corr>corrPearson'      50
'corr>corrSpearman'     50
'corr'                  50
'parseArgs'             50
'parseArgs'             50

For the latter, the functions being internally called by the corr function are:

 Function                Number of Calls
----------------------- -----------------
'tiedrank>tr'           100
'tiedrank'              100
'corr>AS89'             50
'corr>pvalSpearman'     50
'corr>corrPearson'      50
'corr>corrSpearman'     50
'corr'                  50
'parseArgs'             50
'parseArgs'             50

Since the computation of the Spearman correlation for matrices with 10 or more rows seems to run smoothly and quickly and doesn't show any evidence of performance bottlenecks, I decided to avoid losing time investigating on this fact and I focused on the main concern: the small matrices.

I tried to understand the difference between the execution time of the whole process for a matrix with 5 rows and for one with 9 rows (the one notably showing the worst performance). This is the code I used:

res5 = res{5,1};
res5_tt = [res5.TotalTime];
res5_tt_perc = ((res5_tt ./ sum(res5_tt)) .* 100).';

res9_tt = [res{9,1}.TotalTime];
res9_tt_perc = ((res9_tt ./ sum(res9_tt)) .* 100).';

res_diff = res9_tt_perc - res5_tt_perc;
[~,res_diff_sort] = sort(res_diff,'desc');

tab = [cellstr(char(res5.FunctionName)) num2cell([res5_tt_perc res9_tt_perc res_diff])];
tab = tab(res_diff_sort,:);
tab = cell2table(tab,'VariableNames',{'Function' 'TT_M5' 'TT_M9' 'DIFF'});

And here is the result:

       Function                  TT_M5                TT_M9                  DIFF       
_______________________    _________________    __________________    __________________

'corr>spearmanExactSub'     7.14799963478685      16.2879721171023       9.1399724823154
'corr>pvalSpearman'         7.98185309750143      16.3043118970503      8.32245879954885
'perms>permsr'              3.47311716905926      8.73599255035966      5.26287538130039
'perms'                     4.58132952553723      8.77488502392486      4.19355549838763
'corr>corrSpearman'          15.629476293326       16.440893059217     0.811416765890929
'corr>rcumsum'             0.510550019981949    0.0152486312660671    -0.495301388715882
'factorial'                0.669357868472376    0.0163923929871943    -0.652965475485182
'parseArgs'                 1.54242684137027    0.0309456171268161     -1.51148122424345
'tiedrank>tr'               2.37642998160463     0.041010720272735      -2.3354192613319
'parseArgs'                  2.4288171135289    0.0486075856244615     -2.38020952790444
'corr>corrPearson'          2.49766877262937    0.0484657591710417     -2.44920301345833
'tiedrank'                  3.16762535118088    0.0543584195582888     -3.11326693162259
'corr'                      21.8214856092549      16.5664346332513     -5.25505097600355

Once the bottleneck was detected, I started analyzing the internal code (open corr) and I finally found the cause of the problem. Within the spearmanExactSub, this part of code is being executed (where n is the number of rows of the matrix):

n = arg1;
nfact = factorial(n);
Dperm = sum((repmat(1:n,nfact,1) - perms(1:n)).^2, 2);

A permutation is being computed on a vector whose values range from 1 to n. This is what comes into play increasing the computational complexity (and, obviously, the computational time) of the function. Other operations, like the subsequent repmat on factorial(n) of 1:n and the ones below that point, contribute to worsen the situation. Now, long story short...

factorial(5) = 120
factorial(6) = 720
factorial(7) = 5040
factorial(8) = 40320
factorial(9) = 362880

can you see the reason why, between 5 and 9, your bar graph shows an "exponentially" increasing computational time?

On a side note, there is nothing you can do to solve this problem, unless you find another implementation of the Spearman correlation that doesn't present the same bottleneck or you implement your own.

like image 82
Tommaso Belluzzo Avatar answered Oct 21 '22 07:10

Tommaso Belluzzo