Given three arrays of char, say size(a) = [N,80]
, size(b) = [N,100]
; size(c) = [N,10]
;
When N=5
a
, b
and c
look something like,
ans =
5×80 char array
‘efawefref’
‘Afreafraef’
‘afeafefaef’
‘afeafeaffa’
‘afeafaefae’
I want to find the unique entries (not combinations), this is, the unique entries of x = [a, b, c]
Of course I can do unique([a, b, c])
but this is terrible slow for this data. N~1e7
Example,
a = [ 'timon ';
'simba ';
'nala ';
'timon ';
'mufasa'];
b = [ 'boar ';
'lion ';
'lionese';
'boar ';
'lion '];
c = [ 'chubby';
'small ';
'fat ';
'chubby';
'fit '];
unique([a,b,c],'rows')
ans =
4×19 char array
'mufasalion fit '
'nala lionesefat '
'simba lion small '
'timon boar chubby'
size(unique([a,b,c],'rows'),1)
ans =
4
IS there a smarter way to do this?
EDIT: results from answers For entries of these sizes,
>> size(a)
ans =
11724952 76
>> size(b)
ans =
11724952 64
>> size(c)
ans =
11724952 6
Results
@myradio
>> tic, size(unique(horzcat(a,b,c),'rows')), toc
ans =
1038303 146
Elapsed time is 74.402044 seconds.
@gnovice 1
>> tic, size(unique(cellstr([a b c]))), toc
ans =
1038303 1
Elapsed time is 77.044463 seconds.
@gnovice 2
>> tic, map = containers.Map(cellstr([a b c]), ones(length(a), 1)); size(map.keys.'), toc
ans =
1038303 1
Elapsed time is 58.732947 seconds.
@Wolfie
>> tic, size(unique( [categorical(cellstr(a)),categorical(cellstr(b)),categorical(cellstr(c))], 'rows' )), toc
ans =
1038303 3
Elapsed time is 189.517131 seconds.
@obchardon
>> tic, x = primes(2000); a1 = prod(x(a+0),2); b1 = prod(x(b+0),2); c1 = prod(x(c+0),2); size(unique([a1,b1,c1],'rows')), toc
ans =
1038258 3
Elapsed time is 46.889431 seconds.
I am puzzled about this last one, I tried with other examples and it always gives a slightly lower value.
To mimic the larger set of data in the question, I created the following randomized character arrays using randi
:
a = char(randi([65 90], [100 76])); % Generate 100 76-character arrays
a = a(randi([1 100], [11724952 1]), :); % Replicate rows: 11724952-by-76 result
b = char(randi([65 90], [100 64])); % Generate 100 64-character arrays
b = b(randi([1 100], [11724952 1]), :); % Replicate rows: 11724952-by-64 result
c = char(randi([65 90], [100 6])); % Generate 100 6-character arrays
c = c(randi([1 100], [11724952 1]), :); % Replicate rows: 11724952-by-6 result
With up to 100 unique strings in each of a
, b
, and c
, this will yield close to 1,000,000 unique combinations when concatenated.
I then tested 3 solutions: the original using unique
, a variant that converts the character array to a cell array of strings using cellstr
to avoid using the 'rows'
argument, and one using a containers.Map
object. The last one feeds the strings as keys to the containers.Map
class (with dummy associated values) and lets it create a map that will have only the unique strings as its keys, which you can then extract.
Since these tests took a minimum of 1 minute to run, it wasn't feasible to use the more accurate timing routine timeit
(which runs the function many times over to get an average measurement). I therefore used tic
/toc
. Here are some typical results using version R2018a:
>> clear d
>> tic; d = unique(horzcat(a, b, c), 'rows'); toc
Elapsed time is 726.324408 seconds.
>> clear d
>> tic; d = unique(cellstr([a b c])); toc
Elapsed time is 99.312927 seconds.
>> clear d
>> tic; map = containers.Map(cellstr([a b c]), ones(size(a, 1), 1)); d = map.keys.'; toc
Elapsed time is 89.853430 seconds.
The two faster solutions typically averaged around the same, with the containers.Map
being slightly faster on average. They are both much faster than using unique
with the 'rows'
argument, although this is in disagreement with the results in the post using version R2018b. Maybe unique
had significant updates in the newer version, or maybe the specific content of the character arrays matters greatly (e.g. whether all strings repeat with roughly equal frequency, if the arrays are sorted versus unsorted, etc.).
Categorical arrays are often quicker for this sort of thing, as they are roughly treated as ordinals internally.
% Set up your example
a = [ 'timon '; 'simba '; 'nala '; 'timon '; 'mufasa'];
b = [ 'boar '; 'lion '; 'lionese'; 'boar '; 'lion '];
c = [ 'chubby'; 'small '; 'fat '; 'chubby'; 'fit '];
% Make the arrays larger and join into one big categorical array
k = [categorical(cellstr(a)),categorical(cellstr(b)),categorical(cellstr(c))];
% Get unique rows
u = unique( k, 'rows' );
We can make the categorical(cellstr(...))
look a bit cleaner if operating on lots of variables by using an anonymous function:
cc = @(x) categorical(cellstr(x));
u = unique( [cc(a), cc(b), cc(c)], 'rows' );
Edit: Not sure this actually shows a speed-up, the categorical
call is really slow for large char arrays,my test was rubbish.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With