I have a row of data, say A = [0 1 1 1 0 0]
.
Matrix B contains many rows. For a dummy example let's say it's just B = [1 1 1 0 1 0; 1 0 0 1 0 1]
.
I want to find the number of columns in which A and a row of B differ, and use that vector of differences to find which row of B is most similar to A. So for the example above, A differs from B(1,:)
in columns 1, 4, 5 = 3 total difference. A differs from B(2,:) in columns 1, 2, 3, 6 = 4 total differences, and so I would want to return index 1 to indicate that A is most similar to B(1,:).
In reality B has ~50,000 rows, and A and B both have about 800 columns. My current code to find the most similar row is below:
min(sum(xor(repmat(A,B_rows,1),B),2));
That works, but it's very slow. Any insights into which function is taking so long and ways to improve it?
There are 6 more or less similar approaches with bsxfun
, hard to tell which one is the fastest.
%example data
A = [0 1 1 1 0 0];
B = perms(A); %720x6 matrix
Counting similarities:
Z = sum( bsxfun(@eq, A,B) , 2);
Z = sum( ~bsxfun(@ne, A,B) , 2);
Z = sum( ~bsxfun(@xor, A,B) , 2);
Counting differences:
Z = sum( ~bsxfun(@eq, A,B) , 2);
Z = sum( bsxfun(@ne, A,B) , 2);
Z = sum( bsxfun(@xor, A,B) , 2);
Z
is a vector containing the number of equal/unequal elements of A
for every row of B
.
Benchmark for 100 trials each (same order as above):
t100 =
0.0081
0.0098
0.0123
0.0102
0.0087
0.0111
Use pdist2
with 'hamming'
distance
[D I] = pdist2( A, B, 'hamming', 'Smallest', 1 );
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With