Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find maximum length of all n-word-length substrings shared by two strings

I am working to produce a Python script that can find the (longest possible) length of all n-word-length substrings shared by two strings, disregarding trailing punctuation. Given two strings:

"this is a sample string"

"this is also a sample string"

I want the script to identify that these strings have a sequence of 2 words in common ("this is") followed by a sequence of 3 words in common ("a sample string"). Here is my current approach:

a = "this is a sample string"
b = "this is also a sample string"

aWords = a.split()
bWords = b.split()

#create counters to keep track of position in string
currentA = 0
currentB = 0

#create counter to keep track of longest sequence of matching words
matchStreak = 0

#create a list that contains all of the matchstreaks found
matchStreakList = []

#create binary switch to control the use of while loop
continueWhileLoop = 1

for word in aWords:
    currentA += 1

    if word == bWords[currentB]:
        matchStreak += 1

        #to avoid index errors, check to make sure we can move forward one unit in the b string before doing so
        if currentB + 1 < len(bWords):
            currentB += 1

        #in case we have two identical strings, check to see if we're at the end of string a. If we are, append value of match streak to list of match streaks
        if currentA == len(aWords):
            matchStreakList.append(matchStreak)

    elif word != bWords[currentB]:

        #because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter
        if matchStreak >= 1:
            matchStreakList.append(matchStreak)
        matchStreak = 0

        while word != bWords[currentB]:

            #the two words don't match. If you can move b forward one word, do so, then check for another match
            if currentB + 1 < len(bWords):
                currentB += 1

            #if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches
            elif currentB + 1 == len(bWords):
                currentB = 0
                break

        if word == bWords[currentB]:
            matchStreak += 1

            #now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning
            if currentB + 1 < len(bWords):
                currentB += 1
            elif currentB + 1 == len(bWords):

                #we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList
                if currentA == len(aWords):
                    matchStreakList.append(matchStreak)
                currentB = 0
                break

print matchStreakList

This script correctly outputs the (maximum) lengths of the common word-length substrings (2, 3), and has done so for all tests so far. My question is: Is there a pair of two strings for which the approach above will not work? More to the point: Are there extant Python libraries or well-known approaches that can be used to find the maximum length of all n-word-length substrings that two strings share?

[This question is distinct from the longest common substring problem, which is only a special case of what I'm looking for (as I want to find all common substrings, not just the longest common substring). This SO post suggests that methods such as 1) cluster analysis, 2) edit distance routines, and 3) longest common sequence algorithms might be suitable approaches, but I didn't find any working solutions, and my problem is perhaps slightly easier that that mentioned in the link because I'm dealing with words bounded by whitespace.]

EDIT:

I'm starting a bounty on this question. In case it will help others, I wanted to clarify a few quick points. First, the helpful answer suggested below by @DhruvPathak does not find all maximally-long n-word-length substrings shared by two strings. For example, suppose the two strings we are analyzing are:

"They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"

and

"You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"

In this case, the list of maximally long n-word-length substrings (disregarding trailing punctuation) is:

all
are
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every

Using the following routine:

#import required packages
import difflib

#define function we'll use to identify matches
def matches(first_string,second_string):
    s = difflib.SequenceMatcher(None, first_string,second_string)
    match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match

a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"

a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()

print matches(a,b)

One gets output:

['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill']

In the first place, I am not sure how one could select from this list the substrings that contain only whole words. In the second place, this list does not include "are", one of the desired maximally-long common n-word-length substrings. Is there a method that will find all of the maximally long n-word-long substrings shared by these two strings ("You are all..." and "They all are...")?

like image 734
duhaime Avatar asked Nov 28 '13 13:11

duhaime


People also ask

How do you find the longest common substring of multiple strings?

The longest common substrings of a set of strings can be found by building a generalized suffix tree for the strings, and then finding the deepest internal nodes which have leaf nodes from all the strings in the subtree below it.

How do you find the common substring of two strings?

For every character in string 1 we increment vector index of that character eg: v[s1[i]-'a']++, for every character of string 2 we check vector for the common characters if v[s2[i]-'a'] > 0 then set flag = true and v[s2[i]-'a']– such that one character of string 2 is compared with only one character of string 1.

How do you find the length of a substring?

There are two ways to find the length of the longest substring: The Simple method extracts all the substrings from a string, and then calculates the length of substrings with only distinct characters. There will be [(n * (n + 1)) / 2] substrings in a string of n characters.


2 Answers

There are still ambiguities here, and I don't want to spend time arguing about them. But I think I can add something helpful anyway ;-)

I wrote Python's difflib.SequenceMatcher, and spent a lot of time finding expected-case fast ways to find longest common substrings. In theory, that should be done with "suffix trees", or the related "suffix arrays" augmented with "longest common prefix arrays" (the phrases in quotes are search terms if you want to Google for more). Those can solve the problem in worst-case linear time. But, as is sometimes the case, the worst-case linear-time algorithms are excruciatingly complex and delicate, and suffer large constant factors - they can still pay off hugely if a given corpus is going to be searched many times, but that's not the typical case for Python's difflib and it doesn't look like your case either.

Anyway, my contribution here is to rewrite SequenceMatcher's find_longest_match() method to return all the (locally) maximum matches it finds along the way. Notes:

  1. I'm going to use the to_words() function Raymond Hettinger gave you, but without the conversion to lower case. Converting to lower case leads to output that isn't exactly like what you said you wanted.

  2. Nevertheless, as I noted in a comment already, this does output "quill", which isn't in your list of desired outputs. I have no idea why it isn't, since "quill" does appear in both inputs.

Here's the code:

import re
def to_words(text):
    'Break text into a list of words without punctuation'
    return re.findall(r"[a-zA-Z']+", text)

def match(a, b):
    # Make b the longer list.
    if len(a) > len(b):
        a, b = b, a
    # Map each word of b to a list of indices it occupies.
    b2j = {}
    for j, word in enumerate(b):
        b2j.setdefault(word, []).append(j)
    j2len = {}
    nothing = []
    unique = set() # set of all results
    def local_max_at_j(j):
        # maximum match ends with b[j], with length j2len[j]
        length = j2len[j]
        unique.add(" ".join(b[j-length+1: j+1]))
    # during an iteration of the loop, j2len[j] = length of longest
    # match ending with b[j] and the previous word in a
    for word in a:
        # look at all instances of word in b
        j2lenget = j2len.get
        newj2len = {}
        for j in b2j.get(word, nothing):
            newj2len[j] = j2lenget(j-1, 0) + 1
        # which indices have not been extended?  those are
        # (local) maximums
        for j in j2len:
            if j+1 not in newj2len:
                local_max_at_j(j)
        j2len = newj2len
    # and we may also have local maximums ending at the last word
    for j in j2len:
        local_max_at_j(j)
    return unique

Then:

a = "They all are white a sheet of spotless paper " \
    "when they first are born but they are to be " \
    "scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless " \
    "paper, when you first are born; but you are to " \
    "be scrawled and blotted by every goose's quill"

print match(to_words(a), to_words(b))

displays:

set(['all',
     'and blotted by every',
     'first are born but',
     'are to be scrawled',
     'are',
     'spotless paper when',
     'white a sheet of',
     'quill'])

EDIT - how it works

A great many sequence matching and alignment algorithms are best understood as working on a 2-dimensional matrix, with rules for computing the matrix entries and later interpreting the entries' meaning.

For input sequences a and b, picture a matrix M with len(a) rows and len(b) columns. In this application, we want M[i, j] to contain the length of the longest common contiguous subsequence ending with a[i] and b[j], and the computational rules are very easy:

  1. M[i, j] = 0 if a[i] != b[j].
  2. M[i, j] = M[i-1, j-1] + 1 if a[i] == b[j] (where we assume an out-of-bounds matrix reference silently returns 0).

Interpretation is also very easy in this case: there is a locally maximum non-empty match ending at a[i] and b[j], of length M[i, j], if and only if M[i, j] is non-zero but M[i+1, j+1] is either 0 or out-of-bounds.

You can use those rules to write very simple & compact code, with two loops, that computes M correctly for this problem. The downside is that the code will take (best, average and worst cases) O(len(a) * len(b)) time and space.

While it may be baffling at first, the code I posted is doing exactly the above. The connection is obscured because the code is heavily optimized, in several ways, for expected cases:

  • Instead of doing one pass to compute M, then another pass to interpret the results, computation and interpretation are interleaved in a single pass over a.

  • Because of that, the whole matrix doesn't need to be stored. Instead only the current row (newj2len) and the previous row (j2len) are simultaneously present.

  • And because the matrix in this problem is usually mostly zeroes, a row here is represented sparsely, via a dict mapping column indices to non-zero values. Zero entries are "free", in that they're never stored explicitly.

  • When processing a row, there's no need to iterate over each column: the precomputed b2j dict tells us exactly the interesting column indices in the current row (those columns that match the current word from a).

  • Finally, and partly by accident, all the preceeding optimizations conspire in such a way that there's never a need to know the current row's index, so we don't have to bother computing that either.

EDIT - the dirt simple version

Here's code that implements the 2D matrix directly, with no attempts at optimization (other than that a Counter can often avoid explicitly storing 0 entries). It's extremely simple, short and easy:

def match(a, b):
    from collections import Counter
    M = Counter()
    for i in range(len(a)):
        for j in range(len(b)):
            if a[i] == b[j]:
                M[i, j] = M[i-1, j-1] + 1
    unique = set()
    for i in range(len(a)):
        for j in range(len(b)):
            if M[i, j] and not M[i+1, j+1]:
                length = M[i, j]
                unique.add(" ".join(a[i+1-length: i+1]))
    return unique

Of course ;-) that returns the same results as the optimized match() I posted at first.

EDIT - and another without a dict

Just for fun :-) If you have the matrix model down pat, this code will be easy to follow. A remarkable thing about this particular problem is that a matrix cell's value only depends on the values along the diagonal to the cell's northwest. So it's "good enough" just to traverse all the main diagonals, proceeding southeast from all the cells on the west and north borders. This way only small constant space is needed, regardless of the inputs' lengths:

def match(a, b):
    from itertools import chain
    m, n = len(a), len(b)
    unique = set()
    for i, j in chain(((i, 0) for i in xrange(m)),
                      ((0, j) for j in xrange(1, n))):
        k = 0
        while i < m and j < n:
            if a[i] == b[j]:
                k += 1
            elif k:
                unique.add(" ".join(a[i-k: i]))
                k = 0
            i += 1
            j += 1
        if k:
            unique.add(" ".join(a[i-k: i]))
    return unique
like image 94
Tim Peters Avatar answered Nov 12 '22 13:11

Tim Peters


There are really four questions embedded in your post.

1) How do you split text into words?

There are many ways to do this depending on what you count as a word, whether you care about case, whether contractions are allowed, etc. A regular expression lets you implement your choice of word splitting rules. The one I typically use is r"[a-z'\-]+". The catches contractions like don't and allow hyphenated words like mother-in-law.

2) What data structure can speed-up the search for common subsequences?

Create a location map showing for each word. For example, in the sentence you should do what you like the mapping for for you is {"you": [0, 4]} because it appears twice, once at position zero and once at position four.

With a location map in hand, it is a simple matter to loop-over the starting points to compare n-length subsequences.

3) How do I find common n-length subsequences?

Loop over all words in the one of the sentences. For each such word, find the places where it occurs in the other sequence (using the location map) and test whether the two n-length slices are equal.

4) How do I find the longest common subsequence?

The max() function finds a maximum value. It takes a key-function such as len() to determine the basis for comparison.

Here is some working code that you can customize to your own interpretation of the problem:

import re

def to_words(text):
    'Break text into a list of lowercase words without punctuation'
    return re.findall(r"[a-z']+", text.lower())

def starting_points(wordlist):
    'Map each word to a list of indicies where the word appears'
    d = {}
    for i, word in enumerate(wordlist):
        d.setdefault(word, []).append(i)
    return d

def sequences_in_common(wordlist1, wordlist2, n=1):
    'Generate all n-length word groups shared by two word lists'
    starts = starting_points(wordlist2)
    for i, word in enumerate(wordlist1):
        seq1 = wordlist1[i: i+n]
        for j in starts.get(word, []):
            seq2 = wordlist2[j: j+n]
            if seq1 == seq2 and len(seq1) == n:
                yield ' '.join(seq1)

if __name__ == '__main__':

    t1 = "They all are white a sheet of spotless paper when they first are " \
         "born but they are to be scrawled upon and blotted by every goose quill"

    t2 = "You are all white, a sheet of lovely, spotless paper, when you first " \
         "are born; but you are to be scrawled and blotted by every goose's quill"

    w1 = to_words(t1)
    w2 = to_words(t2)

    for n in range(1,10):
        matches = list(sequences_in_common(w1, w2, n))
        if matches:
            print(n, '-->', max(matches, key=len))
like image 43
Raymond Hettinger Avatar answered Nov 12 '22 13:11

Raymond Hettinger