I have a large alphabetically ordered cell array of strings (~495 thousand), with lots of duplicates (which are next to each other, because it's alphabetical).
For a given look-up string, I need to find all the strings in the list which will match the one I pass in.
I've been using strcmp(lookUpString,list)
to do this, but this is extremely slow-- I think it's going through each value in the list to compare, because it doesn't know it's alphabetically sorted.
I could write a while loop to iterate through the list to compare each string using strcmp
until I find the block of strings I want (and then stop), but I was wondering if there was a "matlab" way of doing this (i.e. performing logical comparison operations on a sorted array).
Thanks for your help!
UPDATE: I was not satisfied with my earlier "Method 3" so I've just re-jigged it a little to get better performance. It now runs almost 10 times faster than a naive strcmp
.
strcmp
wins on my machine (2011b on Linux Mint 12). In particular, it works much better than ismember
. However, you can gain a bit of an extra speed up if you do some manual presorting yourself. Consider the following speed test:
NumIter = 100;
N = 495000;
K = N / 20;
List = cell(N, 1);
for i = 1:20
List(i*K - K + 1:i*K) = cellstr(char(i+96));
end
StrToFind = cell(NumIter, 1);
for j = 1:NumIter
StrToFind{j} = char(round(rand * 20) + 96);
end
%# METHOD 1 (ismember)
tic
for j = 1:NumIter
Index1 = ismember(List, StrToFind{j});
Soln1 = List(Index1);
end
toc
%#METHOD 2 (strcmp)
tic
for j = 1:NumIter
Index2 = strcmp(StrToFind{j}, List);
Soln2 = List(Index2);
end
toc
%#METHOD 3 (strmp WITH MANUAL PRE-SORTING)
tic
for j = 1:NumIter
CurStrToFind = StrToFind{j};
K = 100;
I1 = zeros(K, 2); I1(1, :) = ones(1, 2);
I2 = zeros(K, 2); I2(end, 1) = 1; I2(end, 2) = N;
KDiv = floor(N/K);
for k = 2:K-1
CurSearchNum = k * KDiv;
CurListItem = List{CurSearchNum};
if CurListItem < CurStrToFind; I1(k, 1) = 1; end;
if CurListItem > CurStrToFind; I2(k, 1) = 1; end;
I1(k, 2) = CurSearchNum; I2(k, 2) = CurSearchNum;
end
a = find(I1(:, 1), 1, 'last');
b = find(I2(:, 1), 1, 'first');
ShortList = List(I1(a, 2):I2(b, 2));
Index3 = strcmp(CurStrToFind, ShortList);
Soln3 = ShortList(Index3);
end
toc
The output is:
Elapsed time is 6.411537 seconds.
Elapsed time is 1.396239 seconds.
Elapsed time is 0.150143 seconds.
ismember is your friend. Instead of linear search, it does binary search.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With