Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Serious performance issue with iterating simulations

I recently stumbled upon a performance problem while implementing a simulation algorithm. I managed to find the bottleneck function (signally, it's the internal call to arrayfun that slows everything down):

function sim = simulate_frequency(the_f,k,n)

    r = rand(1,n); % 
    x = arrayfun(@(x) find(x <= the_f,1,'first'),r);
    sim = (histcounts(x,[1:k Inf]) ./ n).';

end

It is being used in other parts of code as follows:

h0 = zeros(1,sims);

for i = 1:sims
    p = simulate_frequency(the_f,k,n);
    h0(i) = max(abs(p - the_p));
end

Here are some possible values:

% Test Case 1
sims = 10000;
the_f = [0.3010; 0.4771; 0.6021; 0.6990; 0.7782; 0.8451; 0.9031; 0.9542; 1.0000];
k = 9;
n = 95;

% Test Case 2
sims = 10000;
the_f = [0.0413; 0.0791; 0.1139; 0.1461; 0.1760; 0.2041; 0.2304; 0.2552; 0.2787; 0.3010; 0.3222; 0.3424; 0.3617; 0.3802; 0.3979; 0.4149; 0.4313; 0.4471; 0.4623; 0.4771; 0.4913; 0.5051; 0.5185; 0.5314; 0.5440; 0.5563; 0.5682; 0.5797; 0.5910; 0.6020; 0.6127; 0.6232; 0.6334; 0.6434; 0.6532; 0.6627; 0.6720; 0.6812; 0.6901; 0.6989; 0.7075; 0.7160; 0.7242; 0.7323; 0.7403; 0.7481; 0.7558; 0.7634; 0.7708; 0.7781; 0.7853; 0.7923; 0.7993; 0.8061; 0.8129; 0.8195; 0.8260; 0.8325; 0.8388; 0.8450; 0.8512; 0.8573; 0.8633; 0.8692; 0.8750; 0.8808; 0.8864; 0.8920; 0.8976; 0.9030; 0.9084; 0.9138; 0.9190; 0.9242; 0.9294; 0.9344; 0.9395; 0.9444; 0.9493; 0.9542; 0.9590; 0.9637; 0.9684; 0.9731; 0.9777; 0.9822; 0.9867; 0.9912; 0.9956; 1.000];
k = 90;
n = 95;

The scalar sims must be in the range 1000 1000000. The vector of cumulated frequencies the_f never contains more than 100 elements. The scalar k represents the number of elements in the_f. Finally, the scalar n represents the number of elements in the empirical sample vector, and can even be very large (up to 10000 elements, as far as I can tell).

Any clue about how to improve the computation time of this process?

like image 769
Tommaso Belluzzo Avatar asked Jul 12 '18 20:07

Tommaso Belluzzo


3 Answers

This seems to be slightly faster for me in the second test case, not the first. The time differences might be larger for longer the_f and larger values of n.

function sim = simulate_frequency(the_f,k,n)
    r = rand(1,n); % 
    [row,col] = find(r <= the_f); % Implicit singleton expansion going on here!
    [~,ind] = unique(col,'first');
    x = row(ind);
    sim = (histcounts(x,[1:k Inf]) ./ n).';
end

I'm using implicit singleton expansion in r <= the_f, use bsxfun if you have an older version of MATLAB (but you know the drill).

Find then returns row and column to all the locations where r is larger than the_f. unique finds the indices into the result for the first element of each column.

Credit: Andrei Bobrov over on MATLAB Answers


Another option (derived from this other answer) is a bit shorter but also a bit more obscure IMO:

mask = r <= the_f;
[x,~] = find(mask & (cumsum(mask,1)==1));
like image 57
Cris Luengo Avatar answered Nov 15 '22 10:11

Cris Luengo


If I want performance, I would avoid arrayfun. Even this for loop is faster:

function sim = simulate_frequency(the_f,k,n)

    r = rand(1,n); % 
    for i = 1:numel(r)
        x(i) = find(r(i)<the_f,1,'first');
    end
    sim = (histcounts(x,[1:k Inf]) ./ n).';

end

Running 10000 sims with the first set of the sample data gives the following timing.

Your arrayfun function:

>Elapsed time is 2.848206 seconds.

The for loop function:

>Elapsed time is 0.938479 seconds.

Inspired by Cris Luengo's answer, I suggest below:

function sim = simulate_frequency(the_f,k,n)

    r = rand(1,n); % 
    x = sum(r > the_f)+1;
    sim = (histcounts(x,[1:k Inf]) ./ n)';

end

Time:

>Elapsed time is 0.264146 seconds.
like image 32
Anthony Avatar answered Nov 15 '22 12:11

Anthony


You can use histcounts with r as its input:

r = rand(1,n);
sim = (histcounts(r,[-inf ;the_f]) ./ n).';

If histc is used instead of histcounts the whole simulation can be vectorized:

r = rand(n,sims);
p = histc(r, [-inf; the_f],1);
p = [p(1:end-2,:) ;sum(p(end-1:end,:))]./n;
h0 = max(abs(p-the_p(:)));    %h0 = max(abs(bsxfun(@minus,p,the_p(:))));
like image 24
rahnema1 Avatar answered Nov 15 '22 12:11

rahnema1