Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently find Unique triplets of three char vectors in MATLAB

Given three arrays of char, say size(a) = [N,80], size(b) = [N,100]; size(c) = [N,10];

When N=5 a, b and c look something like,

ans =

  5×80 char array

‘efawefref’
‘Afreafraef’
‘afeafefaef’
‘afeafeaffa’
‘afeafaefae’

I want to find the unique entries (not combinations), this is, the unique entries of x = [a, b, c]

Of course I can do unique([a, b, c]) but this is terrible slow for this data. N~1e7

Example,

   a = [   'timon ';
        'simba ';
        'nala  ';
        'timon ';
        'mufasa'];

b = [   'boar   ';
        'lion   ';
        'lionese';
        'boar   ';
        'lion   '];

c = [   'chubby';
        'small ';
        'fat   ';
        'chubby';
        'fit   '];


unique([a,b,c],'rows')

ans =

  4×19 char array

    'mufasalion   fit   '
    'nala  lionesefat   '
    'simba lion   small '
    'timon boar   chubby'


size(unique([a,b,c],'rows'),1)

ans =

     4

IS there a smarter way to do this?

EDIT: results from answers For entries of these sizes,

>> size(a)

ans =

    11724952          76

>> size(b)

ans =

    11724952          64

>> size(c)

ans =

    11724952           6

Results

@myradio

>> tic, size(unique(horzcat(a,b,c),'rows')), toc

ans =

     1038303         146

Elapsed time is 74.402044 seconds.

@gnovice 1

>> tic, size(unique(cellstr([a b c]))), toc

ans =

     1038303           1

Elapsed time is 77.044463 seconds.

@gnovice 2

>> tic, map = containers.Map(cellstr([a b c]), ones(length(a), 1)); size(map.keys.'), toc

ans =

     1038303           1

Elapsed time is 58.732947 seconds.

@Wolfie

>> tic, size(unique( [categorical(cellstr(a)),categorical(cellstr(b)),categorical(cellstr(c))], 'rows' )), toc

ans =

     1038303           3

Elapsed time is 189.517131 seconds.

@obchardon

>> tic, x = primes(2000); a1 = prod(x(a+0),2); b1 = prod(x(b+0),2); c1 = prod(x(c+0),2); size(unique([a1,b1,c1],'rows')), toc

ans =

     1038258           3

Elapsed time is 46.889431 seconds.

I am puzzled about this last one, I tried with other examples and it always gives a slightly lower value.

like image 764
myradio Avatar asked Aug 06 '19 13:08

myradio


2 Answers

To mimic the larger set of data in the question, I created the following randomized character arrays using randi:

a = char(randi([65 90], [100 76]));      % Generate 100 76-character arrays
a = a(randi([1 100], [11724952 1]), :);  % Replicate rows: 11724952-by-76 result
b = char(randi([65 90], [100 64]));      % Generate 100 64-character arrays
b = b(randi([1 100], [11724952 1]), :);  % Replicate rows: 11724952-by-64 result
c = char(randi([65 90], [100 6]));       % Generate 100 6-character arrays
c = c(randi([1 100], [11724952 1]), :);  % Replicate rows: 11724952-by-6 result

With up to 100 unique strings in each of a, b, and c, this will yield close to 1,000,000 unique combinations when concatenated.

I then tested 3 solutions: the original using unique, a variant that converts the character array to a cell array of strings using cellstr to avoid using the 'rows' argument, and one using a containers.Map object. The last one feeds the strings as keys to the containers.Map class (with dummy associated values) and lets it create a map that will have only the unique strings as its keys, which you can then extract.

Since these tests took a minimum of 1 minute to run, it wasn't feasible to use the more accurate timing routine timeit (which runs the function many times over to get an average measurement). I therefore used tic/toc. Here are some typical results using version R2018a:

>> clear d
>> tic; d = unique(horzcat(a, b, c), 'rows'); toc
Elapsed time is 726.324408 seconds.

>> clear d
>> tic; d = unique(cellstr([a b c])); toc
Elapsed time is 99.312927 seconds.

>> clear d
>> tic; map = containers.Map(cellstr([a b c]), ones(size(a, 1), 1)); d = map.keys.'; toc
Elapsed time is 89.853430 seconds.

The two faster solutions typically averaged around the same, with the containers.Map being slightly faster on average. They are both much faster than using unique with the 'rows' argument, although this is in disagreement with the results in the post using version R2018b. Maybe unique had significant updates in the newer version, or maybe the specific content of the character arrays matters greatly (e.g. whether all strings repeat with roughly equal frequency, if the arrays are sorted versus unsorted, etc.).

like image 102
gnovice Avatar answered Dec 01 '22 17:12

gnovice


Categorical arrays are often quicker for this sort of thing, as they are roughly treated as ordinals internally.

% Set up your example
a = [ 'timon '; 'simba '; 'nala  '; 'timon '; 'mufasa'];
b = [ 'boar   '; 'lion   '; 'lionese'; 'boar   '; 'lion   '];
c = [ 'chubby'; 'small '; 'fat   '; 'chubby'; 'fit   '];

% Make the arrays larger and join into one big categorical array
k = [categorical(cellstr(a)),categorical(cellstr(b)),categorical(cellstr(c))];

% Get unique rows
u = unique( k, 'rows' );

We can make the categorical(cellstr(...)) look a bit cleaner if operating on lots of variables by using an anonymous function:

cc = @(x) categorical(cellstr(x));
u = unique( [cc(a), cc(b), cc(c)], 'rows' );

Edit: Not sure this actually shows a speed-up, the categorical call is really slow for large char arrays,my test was rubbish.

like image 25
Wolfie Avatar answered Dec 01 '22 16:12

Wolfie