I have a MATLAB routine with one rather obvious bottleneck. I've profiled the function, with the result that 2/3 of the computing time is used in the function levels
:
The function levels
takes a matrix of floats and splits each column into nLevels
buckets, returning a matrix of the same size as the input, with each entry replaced by the number of the bucket it falls into.
To do this I use the quantile
function to get the bucket limits, and a loop to assign the entries to buckets. Here's my implementation:
function [Y q] = levels(X,nLevels)
% "Assign each of the elements of X to an integer-valued level"
p = linspace(0, 1.0, nLevels+1);
q = quantile(X,p);
if isvector(q)
q=transpose(q);
end
Y = zeros(size(X));
for i = 1:nLevels
% "The variables g and l indicate the entries that are respectively greater than
% or less than the relevant bucket limits. The line Y(g & l) = i is assigning the
% value i to any element that falls in this bucket."
if i ~= nLevels % "The default; doesnt include upper bound"
g = bsxfun(@ge,X,q(i,:));
l = bsxfun(@lt,X,q(i+1,:));
else % "For the final level we include the upper bound"
g = bsxfun(@ge,X,q(i,:));
l = bsxfun(@le,X,q(i+1,:));
end
Y(g & l) = i;
end
Is there anything I can do to speed this up? Can the code be vectorized?
Q = quantile(A,n,1) computes quantiles of the columns in A for the n evenly spaced cumulative probabilities. Because 1 is the specified operating dimension, Q has n rows. Q = quantile(A,n,2) computes quantiles of the rows in A for the n evenly spaced cumulative probabilities.
Create Quantiles of a Data Set in R Programming – quantile() Function. quantile() function in R Language is used to create sample quantiles within a data set with probability[0, 1]. Such as first quantile is at 0.25[25%], second is at 0.50[50%], and third is at 0.75[75%].
If I understand correctly, you want to know how many items fell in each bucket. Use:
n = hist(Y,nbins)
Though I am not sure that it will help in the speedup. It is just cleaner this way.
Edit : Following the comment:
You can use the second output parameter of histc
[n,bin] = histc(...) also returns an index matrix bin. If x is a vector, n(k) = >sum(bin==k). bin is zero for out of range values. If x is an M-by-N matrix, then
How About this
function [Y q] = levels(X,nLevels)
p = linspace(0, 1.0, nLevels+1);
q = quantile(X,p);
Y = zeros(size(X));
for i = 1:numel(q)-1
Y = Y+ X>=q(i);
end
This results in the following:
>>X = [3 1 4 6 7 2];
>>[Y, q] = levels(X,2)
Y =
1 1 2 2 2 1
q =
1 3.5 7
You could also modify the logic line to ensure values are less than the start of the next bin. However, I don't think it is necessary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With