I have a program that requires me to find the shortest subsegment of a given String, containing a list of words. Even though my program is correct, I am failing to deliver within a time frame of execution(5 seconds). I figured the problem is because of the complex(trivial) algorithm I am using. It consists of nested loops, and requires multiple scan of the list_of_words array.Here is my code for the search function. a[]
contains the original string,stored by words, b[]
contains the list of the words which is to be found to form the subsegment. String g
stores the temporary subsegment formed by words from original string including the words in the list.
private static void search() // Function to find the subsegment with a required list of words
{
int tail,x;//counters
String c[]=new String[b.length]; //initializing a temporary array to copy the list of words array.
for(int i =0; i<a.length;i++)// looping throw original string array
{
System.arraycopy(b, 0, c, 0, b.length);//copying the list of words array to the temporary array
for (int j=0;j<b.length;j++)//looping throw the temporary array
{
x=0; //counter for temporary array
if(a[i].equalsIgnoreCase(b[j]))//checking for a match with list of words
{
tail=i;
//adds the words starting from the position of the first found word(tail) till all the words from the list are found
while((x<b.length)&&(tail<a.length))
{
g=g+" "+a[tail];//adds the words in the string g
for(int k=0;k<c.length;k++) //checks for the found word from the temporary array and replaces it with ""
{
if(c[k].equalsIgnoreCase(a[tail]))
{
c[k]=""; x++;
}
}
tail++;
}
if((tail==a.length)&&(x!=b.length))//checks if the string g does not contains all the listed word
{
g="";
}
else
{
count(g);g="";//check for the shortest string.
}
}
}
}print();
}
Sample :
Original String :This is a test. This is a programming test. a programming test this is.
Words to be found : this , test, a ,programming.
Subsegments :
This is a test This is a programming
This is a programming test
a programming test a programming test this
programming test a programming test this
test a programming test this
a programming test this
Shortest Sub segment : a programming test this
Any suggestion regarding change in the data structures or looping structures or even changes in the algorithm that optimizes the same will be helpful.
Dynamic Programming solution:
Have a last position variable for each of the words you're looking for.
Have a total count of distinct seen words you're looking for (will never decrease, max = count of words you're looking for).
For each word position in the input:
An optimization is to have a heap for last positions to reduce the time taken to find the smallest one (should be used together with some structure (possibly a hash- or tree-map) that allows fast lookup of pointers into the heap given a word).
Example:
Input: This is a test. This is a programming test. a programming test this is
Looking for: this, test, a, programming
1 2 3 4 5 6 7 8 9 10 11 12 13 14
This is a test. This is a programming test. a programming test this is
this -1 1 1 1 1 5 5 5 5 5 5 5 5 13 13
test -1 -1 -1 -1 4 4 4 4 4 9 9 9 12 12 12
a -1 -1 -1 3 3 3 3 7 7 7 10 10 10 10 10
programming -1 -1 -1 -1 -1 -1 -1 -1 8 8 8 11 11 11 11
Count 0 1 1 2 3 3 3 3 4 4 4 4 4 4 4
Substr len NA NA NA NA NA NA NA NA 5 5 6 7 8 4 5
Shortest len NA NA NA NA NA NA NA NA 5 5 5 5 5 4 4
Best result: a programming test this
, length = 4.
Complexity Analysis:
Let n
be the number of words in the input and k
the number of words we're looking for.
The algorithm only does one pass through the input and, at each step, does O(log k)
work for the getMin
operation (with the heap optimization).
Thus the total time taken is O(n log k)
.
Dealing with duplicates:
If duplicates are allowed in the words we're looking for (and the target sequence must match all occurrences), the algorithm above won't work as is, but a simple fix is to have each distinct word have its own heap of pointers into the original heap (the value in this heap being the same as the value in the original heap), with the maximum size of this heap being the number of occurrences of that word in the words we're looking for.
Here's the implementation that occurs to me.
//Implementing here with two List<String>
//Should be easy enough to use arrays, or streams, or whatever.
public static int getShortestSubseqWith(List<String> text, List<String> words) {
int minDistance = Integer.MAX_VALUE;
//Create a map of the last known position of each word
Map<String, Integer> map = new HashMap();
for (String word : words) {
map.put(word, -1);
}
String word;
//One loop through the main search string
for (int position = 0; position < text.size(); position++){
word = text.get(position);
//If the current word found is in the list we're looking for
if (map.containsKey(word)) {
//Update the map
map.put(word, position);
//And if the current positions are the closest seen so far, update the min value.
int curDistance = getCurDistance(map);
if (curDistance < minDistance)
minDistance = curDistance;
}
}
return minDistance;
}
//Get the current distance between the last known position of each value in the map
private static int getCurDistance(Map<String, Integer> map) {
int min = Integer.MAX_VALUE;
int max = 0;
for (Integer value : map.values()) {
if (value == -1)
return Integer.MAX_VALUE;
else {
max = Math.max(max,value);
min = Math.min(min,value);
}
}
return max - min;
}
The main performance influence here, if hits are relatively sparse, and list of terms to search for relatively small, should just be the loop over the text
to be searched. If hits are very frequent, performance may suffer, due to more frequent runs through getCurDistance
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With