Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Longest common substring from more than two strings

I'm looking for a Python library for finding the longest common sub-string from a set of strings. There are two ways to solve this problem:

  • using suffix trees
  • using dynamic programming.

Method implemented is not important. It is important it can be used for a set of strings (not only two strings).

like image 631
Nicolas NOEL Avatar asked May 23 '10 18:05

Nicolas NOEL


People also ask

How do you find the longest substring in multiple strings?

The longest common substrings of a set of strings can be found by building a generalized suffix tree for the strings, and then finding the deepest internal nodes which have leaf nodes from all the strings in the subtree below it.

How do you find the common substring?

Approach: Let m and n be the lengths of the first and second strings respectively. Dynamic Programming can be used to find the longest common substring in O(m*n) time. The idea is to find the length of the longest common suffix for all substrings of both strings and store these lengths in a table.

What is a time complexity for finding the longest substring that is common?

Θ(n1+n2)


2 Answers

These paired functions will find the longest common string in any arbitrary array of strings:

def long_substr(data):     substr = ''     if len(data) > 1 and len(data[0]) > 0:         for i in range(len(data[0])):             for j in range(len(data[0])-i+1):                 if j > len(substr) and is_substr(data[0][i:i+j], data):                     substr = data[0][i:i+j]     return substr  def is_substr(find, data):     if len(data) < 1 and len(find) < 1:         return False     for i in range(len(data)):         if find not in data[i]:             return False     return True   print long_substr(['Oh, hello, my friend.',                    'I prefer Jelly Belly beans.',                    'When hell freezes over!']) 

No doubt the algorithm could be improved and I've not had a lot of exposure to Python, so maybe it could be more efficient syntactically as well, but it should do the job.

EDIT: in-lined the second is_substr function as demonstrated by J.F. Sebastian. Usage remains the same. Note: no change to algorithm.

def long_substr(data):     substr = ''     if len(data) > 1 and len(data[0]) > 0:         for i in range(len(data[0])):             for j in range(len(data[0])-i+1):                 if j > len(substr) and all(data[0][i:i+j] in x for x in data):                     substr = data[0][i:i+j]     return substr 

Hope this helps,

Jason.

like image 73
jtjacques Avatar answered Oct 17 '22 09:10

jtjacques


This can be done shorter:

def long_substr(data):   substrs = lambda x: {x[i:i+j] for i in range(len(x)) for j in range(len(x) - i + 1)}   s = substrs(data[0])   for val in data[1:]:     s.intersection_update(substrs(val))   return max(s, key=len) 

set's are (probably) implemented as hash-maps, which makes this a bit inefficient. If you (1) implement a set datatype as a trie and (2) just store the postfixes in the trie and then force each node to be an endpoint (this would be the equivalent of adding all substrings), THEN in theory I would guess this baby is pretty memory efficient, especially since intersections of tries are super-easy.

Nevertheless, this is short and premature optimization is the root of a significant amount of wasted time.

like image 44
Herbert Avatar answered Oct 17 '22 08:10

Herbert