Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

method to find the shortest substring containing the given words:optimization required

Tags:

java

algorithm

I have a program that requires me to find the shortest subsegment of a given String, containing a list of words. Even though my program is correct, I am failing to deliver within a time frame of execution(5 seconds). I figured the problem is because of the complex(trivial) algorithm I am using. It consists of nested loops, and requires multiple scan of the list_of_words array.Here is my code for the search function. a[] contains the original string,stored by words, b[] contains the list of the words which is to be found to form the subsegment. String gstores the temporary subsegment formed by words from original string including the words in the list.

private static void search() // Function to find the subsegment with a required list of words
{
   int tail,x;//counters 
   String c[]=new String[b.length]; //initializing a temporary array to copy the list of words array.

   for(int i =0; i<a.length;i++)// looping throw original string array
    {
       System.arraycopy(b, 0, c, 0, b.length);//copying the list of words array to the temporary array

        for (int j=0;j<b.length;j++)//looping throw the temporary array
        { 
            x=0; //counter for temporary array

            if(a[i].equalsIgnoreCase(b[j]))//checking for a match with list of words
            {
                tail=i;
//adds the words starting from the position of the first found word(tail) till all the words from the list are found
                while((x<b.length)&&(tail<a.length))

                {
                    g=g+" "+a[tail];//adds the words in the string g

                    for(int k=0;k<c.length;k++) //checks for the found word from the temporary array and replaces it with ""    
                    {
                        if(c[k].equalsIgnoreCase(a[tail]))
                        {
                            c[k]=""; x++;
                        }
                    }
                    tail++;

                }
                if((tail==a.length)&&(x!=b.length))//checks if the string g does not contains all the listed word
                {
                    g="";
                }
                else
                    {
                    count(g);g="";//check for the shortest string.
                    }
            }
        }

    }print();
}

Sample :

Original String :This is a test. This is a programming test. a programming test this is.

Words to be found : this , test, a ,programming.

Subsegments :

This is a test This is a programming

This is a programming test

a programming test a programming test this

programming test a programming test this

test a programming test this

a programming test this

Shortest Sub segment : a programming test this

Any suggestion regarding change in the data structures or looping structures or even changes in the algorithm that optimizes the same will be helpful.

like image 418
Sushim Mukul Dutta Avatar asked May 03 '13 21:05

Sushim Mukul Dutta


2 Answers

Dynamic Programming solution:

Have a last position variable for each of the words you're looking for.

Have a total count of distinct seen words you're looking for (will never decrease, max = count of words you're looking for).

For each word position in the input:

  • If the word exists in the list of words you're looking for, update the last position of that word.
  • Increase the total count if the updated last position was not initialized.
  • If the total count is equal to the max, loop through the last positions and find the smallest one. The distance between the current position and that value will be the length of the substring. Record these values and find the best one over all positions.

An optimization is to have a heap for last positions to reduce the time taken to find the smallest one (should be used together with some structure (possibly a hash- or tree-map) that allows fast lookup of pointers into the heap given a word).

Example:

Input: This is a test. This is a programming test. a programming test this is

Looking for: this, test, a, programming

                1    2  3  4     5    6  7  8           9     10 11          12   13   14
                This is a  test. This is a  programming test. a  programming test this is
this         -1 1    1  1  1     5    5  5  5           5     5  5           5    13   13
test         -1 -1   -1 -1 4     4    4  4  4           9     9  9           12   12   12
a            -1 -1   -1 3  3     3    3  7  7           7     10 10          10   10   10
programming  -1 -1   -1 -1 -1    -1   -1 -1 8           8     8  11          11   11   11
Count        0  1    1  2  3     3    3  3  4           4     4  4           4    4    4
Substr len   NA NA   NA NA NA    NA   NA NA 5           5     6  7           8    4    5
Shortest len NA NA   NA NA NA    NA   NA NA 5           5     5  5           5    4    4

Best result: a programming test this, length = 4.

Complexity Analysis:

Let n be the number of words in the input and k the number of words we're looking for.

The algorithm only does one pass through the input and, at each step, does O(log k) work for the getMin operation (with the heap optimization).

Thus the total time taken is O(n log k).

Dealing with duplicates:

If duplicates are allowed in the words we're looking for (and the target sequence must match all occurrences), the algorithm above won't work as is, but a simple fix is to have each distinct word have its own heap of pointers into the original heap (the value in this heap being the same as the value in the original heap), with the maximum size of this heap being the number of occurrences of that word in the words we're looking for.

like image 100
Bernhard Barker Avatar answered Sep 21 '22 23:09

Bernhard Barker


Here's the implementation that occurs to me.

//Implementing here with two List<String>
//Should be easy enough to use arrays, or streams, or whatever.
public static int getShortestSubseqWith(List<String> text, List<String> words) {
    int minDistance = Integer.MAX_VALUE;
    //Create a map of the last known position of each word
    Map<String, Integer> map = new HashMap();
    for (String word : words) {
        map.put(word, -1);
    }
    String word;
    //One loop through the main search string
    for (int position = 0; position < text.size(); position++){
        word = text.get(position);
        //If the current word found is in the list we're looking for
        if (map.containsKey(word)) {
            //Update the map
            map.put(word, position);
            //And if the current positions are the closest seen so far, update the min value.
            int curDistance = getCurDistance(map);
            if (curDistance < minDistance)
                minDistance = curDistance;
        }
    }
    return minDistance;
}

//Get the current distance between the last known position of each value in the map
private static int getCurDistance(Map<String, Integer> map) {
    int min = Integer.MAX_VALUE;
    int max = 0;
    for (Integer value : map.values()) {
        if (value == -1)
            return Integer.MAX_VALUE;
        else {
            max = Math.max(max,value);
            min = Math.min(min,value);
        }
    }
    return max - min;
}

The main performance influence here, if hits are relatively sparse, and list of terms to search for relatively small, should just be the loop over the text to be searched. If hits are very frequent, performance may suffer, due to more frequent runs through getCurDistance.

like image 44
femtoRgon Avatar answered Sep 17 '22 23:09

femtoRgon