Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the longest common starting substring in a set of strings [closed]

People also ask

Which is longest common substring in s1 Abcdaf s2 Azbcdf?

The longest common substring is “abcdez” and is of length 6. Try It!

How do you find common substrings?

For every character in string 1 we increment vector index of that character eg: v[s1[i]-'a']++, for every character of string 2 we check vector for the common characters if v[s2[i]-'a'] > 0 then set flag = true and v[s2[i]-'a']– such that one character of string 2 is compared with only one character of string 1.


It's a matter of taste, but this is a simple javascript version: It sorts the array, and then looks just at the first and last items.

//longest common starting substring in an array

function sharedStart(array){
    var A= array.concat().sort(), 
    a1= A[0], a2= A[A.length-1], L= a1.length, i= 0;
    while(i<L && a1.charAt(i)=== a2.charAt(i)) i++;
    return a1.substring(0, i);
}

DEMOS

sharedStart(['interspecies', 'interstelar', 'interstate'])  //=> 'inters'
sharedStart(['throne', 'throne'])                           //=> 'throne'
sharedStart(['throne', 'dungeon'])                          //=> ''
sharedStart(['cheese'])                                     //=> 'cheese'
sharedStart([])                                             //=> ''
sharedStart(['prefix', 'suffix'])                           //=> ''

In Python:

>>> from os.path import commonprefix
>>> commonprefix('interspecies interstelar interstate'.split())
'inters'

Ruby one-liner:

l=strings.inject{|l,s| l=l.chop while l!=s[0...l.length];l}

My Haskell one-liner:

import Data.List

commonPre :: [String] -> String
commonPre = map head . takeWhile (\(x:xs)-> all (==x) xs) . transpose

EDIT: barkmadley gave a good explanation of the code below. I'd also add that haskell uses lazy evaluation, so we can be lazy about our use of transpose; it will only transpose our lists as far as necessary to find the end of the common prefix.


You just need to traverse all strings until they differ, then take the substring up to this point.

Pseudocode:

loop for i upfrom 0
     while all strings[i] are equal
     finally return substring[0..i]

Common Lisp:

(defun longest-common-starting-substring (&rest strings)
  (loop for i from 0 below (apply #'min (mapcar #'length strings))
     while (apply #'char=
                  (mapcar (lambda (string) (aref string i))
                          strings))
     finally (return (subseq (first strings) 0 i))))

Yet another way to do it: use regex greed.

words = %w(interspecies interstelar interstate)
j = '='
str = ['', *words].join(j)
re = "[^#{j}]*"

str =~ /\A
    (?: #{j} ( #{re} ) #{re} )
    (?: #{j}    \1     #{re} )*
\z/x

p $1

And the one-liner, courtesy of mislav (50 characters):

p ARGV.join(' ').match(/^(\w*)\w*(?: \1\w*)*$/)[1]

In Python I wouldn't use anything but the existing commonprefix function I showed in another answer, but I couldn't help to reinvent the wheel :P. This is my iterator-based approach:

>>> a = 'interspecies interstelar interstate'.split()
>>>
>>> from itertools import takewhile, chain, izip as zip, imap as map
>>> ''.join(chain(*takewhile(lambda s: len(s) == 1, map(set, zip(*a)))))
'inters'

Edit: Explanation of how this works.

zip generates tuples of elements taking one of each item of a at a time:

In [6]: list(zip(*a))  # here I use list() to expand the iterator
Out[6]:
[('i', 'i', 'i'),
 ('n', 'n', 'n'),
 ('t', 't', 't'),
 ('e', 'e', 'e'),
 ('r', 'r', 'r'),
 ('s', 's', 's'),
 ('p', 't', 't'),
 ('e', 'e', 'a'),
 ('c', 'l', 't'),
 ('i', 'a', 'e')]

By mapping set over these items, I get a series of unique letters:

In [7]: list(map(set, _))  # _ means the result of the last statement above
Out[7]:
[set(['i']),
 set(['n']),
 set(['t']),
 set(['e']),
 set(['r']),
 set(['s']),
 set(['p', 't']),
 set(['a', 'e']),
 set(['c', 'l', 't']),
 set(['a', 'e', 'i'])]

takewhile(predicate, items) takes elements from this while the predicate is True; in this particular case, when the sets have one element, i.e. all the words have the same letter at that position:

In [8]: list(takewhile(lambda s: len(s) == 1, _))
Out[8]:
[set(['i']),
 set(['n']), 
 set(['t']), 
 set(['e']), 
 set(['r']), 
 set(['s'])]

At this point we have an iterable of sets, each containing one letter of the prefix we were looking for. To construct the string, we chain them into a single iterable, from which we get the letters to join into the final string.

The magic of using iterators is that all items are generated on demand, so when takewhile stops asking for items, the zipping stops at that point and no unnecessary work is done. Each function call in my one-liner has a implicit for and an implicit break.