Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the smallest unique substring for each string in an array

(I'm writing this in the context of JavaScript, but will accept an algorithmically correct answer in any language)

How do you find the shortest substring of each element in an array of strings where the substring is NOT contained within any of the other elements, ignoring case?

Suppose I have an input array such as:

var names = ["Anne", "Anthony", "LouAnn", "Kant", "Louise", "ark"];

The output should be something like:

var uniqueNames = ["ne", "h", "ua", "ka", "i", "r"];

For my purposes, you can safely assume that no element will be wholly contained within another element.

My Thoughts:
It seems that one could probably brute force this, along the lines of:

var names = ["Anne", "Anthony", "LouAnn", "Kant", "Louise", "ark"];
var uniqueNames = [], nameInd, windowSize, substrInd, substr, otherNameInd, foundMatch;
// For each name
for (nameInd = 0; nameInd < names.length; nameInd++)
{
    var name = names[nameInd];
    // For each possible substring length
    windowLoop:
    for (windowSize = 1; windowSize <= name.length; windowSize++)
    {
        // For each starting index of a substring
        for (substrInd = 0; substrInd <= name.length-windowSize; substrInd++)
        {
            substr = name.substring(substrInd,substrInd+windowSize).toLowerCase();
            foundMatch = false;
            // For each other name
            for (otherNameInd = 0; otherNameInd < names.length; otherNameInd++)
            {
                if (nameInd != otherNameInd && names[otherNameInd].toLowerCase().indexOf(substr) > -1)
                {
                    foundMatch = true;
                    break;
                }
            }

            if (!foundMatch)
            {
                // This substr works!
                uniqueNames[nameInd] = substr;
                break windowLoop;
            }
        }
    }
}

But I have to imagine there's a more elegant solution using tries/prefix trees, suffix arrays, or something interesting like that.

Edit: I believe this is the form the selected answer would take programmatically in JavaScript:

var names = ["Anne", "Anthony", "LouAnn", "Kant", "Louise", "ark"];
var uniqueNames = [], permutations = {}, permutation, nameInd, windowSize, substrInd, substr;

// For each name
for (nameInd = 0; nameInd < names.length; nameInd++)
{
    var name = names[nameInd];
    // For each possible substring length
    windowLoop:
    for (windowSize = 1; windowSize <= name.length; windowSize++)
    {
        // For each starting index of a substring
        for (substrInd = 0; substrInd <= name.length-windowSize; substrInd++)
        {
            substr = name.substring(substrInd,substrInd+windowSize).toLowerCase();
            permutations[substr] = (typeof permutations[substr] === "undefined")?nameInd:-1;
        }
    }
}

for (substr in permutations)
{
    permutation = permutations[substr];
    if (permutation !== -1 && ((typeof uniqueNames[permutation] === "string" && substr.length < uniqueNames[permutation].length) || typeof uniqueNames[permutation] === "undefined"))
    {
        uniqueNames[permutation] = substr;
    }
}
like image 979
Patrick Avatar asked Jun 28 '12 13:06

Patrick


People also ask

How do you find the smallest substring in a given string containing all characters of another string python?

Smallest window in a String containing all characters of other String using Hashing: The idea is to use the two pointer approach on the hash array of pattern string and then find the minimum window by eliminating characters from the start of the window.


1 Answers

This problem can be solved in O(N*L*L*L) complexity. The approach will be using suffix tries. Each node of the trie will be also storing the prefix count which will refer to the number of times the substring formed while traversing to that node from the root have appeared in all of the suffixes inserted till now.

We will be constructing N+1 tries. The first trie will be global and we will be inserting all the suffixes of all N string into it. The next N tries will be local for each of the N strings containing corresponding suffixes.

This preprocessing step of constructing tries will be done in O(N*L*L).

Now once the tries have been constructed, for each string, we can start quering the number of times a substring ( starting from minimum length) has occured in the global trie and the trie corresponding to that string. If it is same in both then it implies that it is not included in any other strings except itself. This can be achieved in O(N*L*L*L). The complexity can be explained as N for each string, L*L for considering each substring and L for performing query in the trie.

like image 181
Mukesh Avatar answered Oct 13 '22 13:10

Mukesh