Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to find unique values in an array

Tags:

arrays

matlab

I'm trying to find a fastest way for finding unique values in a array and to remove 0 as a possibility of unique value.

Right now I have two solutions:

result1 = setxor(0, dataArray(1:end,1)); % This gives the correct solution
result2 = unique(dataArray(1:end,1)); % This solution is faster but doesn't give the same result as result1

dataArray is equivalent to :

dataArray = [0 0; 0 2; 0 4; 0 6; 1 0; 1 2; 1 4; 1 6; 2 0; 2 2; 2 4; 2 6]; % This is a small array, but in my case there are usually over 10 000 lines.

So in this case, result1 is equal to [1; 2] and result2 is equal to [0; 1; 2]. The unique function is faster but I don't want 0 to be considered. Is there a way to do this with unique and not consider 0 as a unique value? Is there an another alternative?

EDIT

I wanted to time the various solutions.

clc
dataArray = floor(10*rand(10e3,10));
dataArray(mod(dataArray(:,1),3)==0)=0;
% Initial
tic
for ii = 1:10000
   FCT1 = setxor(0, dataArray(:,1));
end
toc
% My solution
tic
for ii = 1:10000
   FCT2 = unique(dataArray(dataArray(:,1)>0,1));
end
toc
% Pursuit solution
tic
for ii = 1:10000
   FCT3 = unique(dataArray(:, 1));
   FCT3(FCT3==0) = [];
end
toc
% Pursuit solution with chappjc comment
tic
for ii = 1:10000
   FCT32 = unique(dataArray(:, 1));
   FCT32 = FCT32(FCT32~=0);
end
toc
% chappjc solution
tic
for ii = 1:10000
   FCT4 = setdiff(unique(dataArray(:,1)),0);
end
toc
% chappjc 2nd solution
tic
for ii = 1:10000
   FCT5 = find(accumarray(dataArray(:,1)+1,1))-1;
   FCT5 = FCT5(FCT5>0);
end
toc

And the results:

Elapsed time is 5.153571 seconds. % FCT1 Initial
Elapsed time is 3.837637 seconds. % FCT2 My solution
Elapsed time is 3.464652 seconds. % FCT3 Pursuit solution
Elapsed time is 3.414338 seconds. % FCT32 Pursuit solution with chappjc comment
Elapsed time is 4.097164 seconds. % FCT4 chappjc solution
Elapsed time is 0.936623 seconds. % FCT5 chappjc 2nd solution

However, the solution with sparse and accumarray only works with integer. These solutions won't work with double.

like image 258
m_power Avatar asked Dec 18 '13 20:12

m_power


2 Answers

Here's a wacky suggestion with accumarray, demonstrated using Floris' test data:

a = floor(10*rand(100000, 1)); a(mod(a,3)==0)=0;
result = find(accumarray(nonzeros(a(:,1))+1,1))-1;

Thanks to Luis Mendo for pointing out that with nonzeros, it is not necessary to perform result = result(result>0)!

Note that this solution requires integer-valued data (not necessarily an integer data type, but just not with decimal components). Comparing floating point values for equality, as unique would do, is perilous. See here and here.


Original suggestion: Combine unique with setdiff:

result = setdiff(unique(a(:,1)),0)

Or remove with logical indexing after unique:

result = unique(a(:,1));
result = result(result>0);

I generally prefer not to assign [] as in (result(result==0)=[];) since it gets very inefficient for large data sets.

Removing zeros after unique should be faster since the it operates on less data (unless every element is unique, OR if a/dataArray is very short).

like image 86
chappjc Avatar answered Oct 10 '22 16:10

chappjc


Just to add to the general clamor - here are three different methods. They all give the same answer, but slightly different timings:

a = floor(10*rand(100000, 1));
a(mod(a,3)==0)=0;
tic
b1 = unique(a(:,1));
b1(b1==0) = [];
toc
tic
b2 = find(sparse(a(:,1)+1, 1, 1)) - 1;
b2(b2==0)=[];
toc
tic
b3 = setxor(0, a(:, 1), 'rows');
toc

display(b1)
display(b2)
display(b3)

On my machine, the timings (for an array of 100000 elements) were as follows:

0.0087 s  - for unique
0.0142 s  - for find(sparse)
0.0302 s  = for setxor

I always like sparse for a problem like this - you get the count of elements at the same time as their unique values.

EDIT per @chappj's suggestion. I added a fourth option

b4 = find(accumarray(a(:,1)+1,1)-1);
b4(b4==0) = [];

Time:

0.0029 s , THREE TIMES FASTER THAN UNIQUE

Ladies and gentlemen, we have a winner.

AFTERWORD the index-based methods (sparse and accumarray) only work with integer-valued inputs (although they can be of double type). This seemed OK based on the input array given in the question, but doesn't work for non-integer valued inputs. Of course, unique is a tricky concept when you have doubles - number that "look" the same may be represented differently. You might consider truncating the input array (sanitizing the data) to make sure this is not a problem. For example, if you did

a = 0.001 * double(int(a * 1000));

You would round all values to no more than 3 significant figures, and because you went "via an int" you are sure that you don't end up with values that are "very subtly different" (say in the 8th digit or beyond). Of course in that case you could also do

a = round(a * 1000);
mina = min(a(:));
b = find(accumarray(a - mina + 1, 1)) + mina - 1;
b = 0.001 * b(b ~= 0);

This is "fairly robust" for non-integer values (in the above case it handles values with up to three significant digits; if you need more, the space requirements will eventually get too large and this method will be slower than unique, which in fact has to sort the data.)

like image 26
Floris Avatar answered Oct 10 '22 15:10

Floris