Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Length of longest repeating string within a longer string

I'm working with a series of long strings (e.g., 'ABAABBFGGBHHSFAFDAFDAFDBB'), that have different lengths. For each string, I'd like to find the length of the longest consecutive occurrences of a specific substring (for instance 'AFD', for which the answer in the above example is 3). Any elegant way of achieving this with MATLAB ?

like image 573
Peter Van Nostrand Avatar asked Jun 03 '15 10:06

Peter Van Nostrand


2 Answers

Let the input and search strings be -

in_str = 'ABAAAFDAFDBBFGGBHHSFAFDAFDAFDBB'
search_str = 'AFD'

We can use strfind to get starting indices for search string in input string and from those detect consecutive groups of search string -

idx = strfind(in_str,search_str)
grp_matches = diff(idx)==numel(search_str)

So, now we have "islands" of zeros and ones, where the islands of ones represent the presence of consecutive grouped search stings. Next up, we need to find the islands lengths and the maximum island length would be the desired output -

df1 = diff([0 grp_matches 0]) %// Perform differentiation of matches

The ending of islands would be indicated by "-1" in differentiation result and "1" would denote the starting of those islands. So, (find(df1==-1) - find(df1==1))+1 would be the island lengths. The final output would be the max of it -

out = max(find(df1==-1) - find(df1==1))+1 

Summing up the discussion, the entire code could be made into a compact version like so -

df1 = diff([0 diff(strfind(in_str,search_str))==numel(search_str) 0])
out = max(find(df1==-1) - find(df1==1))+1 
like image 81
Divakar Avatar answered Nov 01 '22 19:11

Divakar


Use a regular expression:

str = 'AFDABAABBFGGBHHSFAFDAFDAFDBB'; %// note: two occurrences of repeated 'AFD'
substr = 'AFD';                       %// sought substring 
r = regexp(str, ['(' substr ')+'], 'match');
lengths = cellfun(@numel, r)/numel(substr);
result = max(lengths);

You can increase speed using 'length' instead of @numel, as suggested by Divakar:

lengths = cellfun('length', r)/numel(substr);

In this example,

lengths =
     1     3
result =
     3
like image 35
Luis Mendo Avatar answered Nov 01 '22 17:11

Luis Mendo