Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Longest substring that matches a string in an array

Assume I have the following input and that my implementation language is Java:

  • An array, A, with the contents: ["brown fox", "jumped over the", "lazy dog", "dog", "the", "fish", "quantum burrito", "ox jumped over the laz", "and ate", "ate pie"]

  • A string, S, with the contents: "the quick brown fox jumped over the lazy dog and ate pie" (first character index 0, last character index 55)

I need to (as efficiently as is practical on a typical computer) assemble a list of substrings of the string S that are contained (entirely) within an element of the array A, and get those in descending order. I also need to know the starting and ending character index within the string S of each match. ... But with some constraints.

The following constraints and peculiarities apply to this problem:

  • Not all elements in the array A may be contained within the string S (in the example, "fish" and "quantum burrito" are not in S).
  • The string S may contain lengths of characters that don't match any elements within the array (in the example, "quick" in S does not match anything in A).
  • Respect word boundaries within S (words are guaranteed to be delimited by exactly one space in both A and S); meaning, it is not a match if a length of characters within S matches A but violates word boundaries by not capturing one or more whole words.
  • When there is a tie on length, sort order in the result array is irrelevant.
  • Once a range of characters within S is matched, that range will only be captured in one result element, even if it could match multiple elements within A.
  • If there are two possible matches, pick one arbitrarily depending on the order the algorithm processes the elements in the array.
  • I need to keep track of which ranges of characters don't get matched after the algorithm is done.

Working this out manually just by looking at the string and array, in this example, the solution would be the following, given in the correct descending order (zero-based indexing):

  1. The range of characters [20..34] ("jumped over the") is in index 1 of the array. Length = 15
  2. The range of characters [10..18] ("brown fox") is in index 0 of the array. Length = 9
  3. The range of characters [36..43] ("lazy dog") is in index 2 of the array. Length = 8
  4. The range of characters [49..55] ("ate pie") is in index 9 of the array (arbitrary match; matching "and ate" is equally valid, but we don't match both because "ate" is already "consumed"; no pun intended). Length = 7
  5. The range of characters [0..2] ("the") is in index 4 of the array. Length = 3
  6. The word "quick" was not matched to any element in the array.
  7. The word "and" was not matched to any element in the array.

Note, specifically, that "ox jumped over the laz", although it is the longest substring in A that is within S, is not matched in the result set because it violates the word boundaries of "fox" and "lazy".

Question: Am I describing a fairly standard algorithm that may exist in a library (in part or in whole; I am willing to build this out of simpler primitive building blocks) or is this something so custom that I need to implement it from scratch?

If I implement it from scratch, I think I need to take an approach broadly sketched out like the following:

  • Split string S on word boundaries
  • Construct a list L of all (order-respecting) word sequences within the string S in descending-length order (for example: ["the quick brown fox jumped over the lazy dog and ate pie", "the quick brown fox jumped over the lazy dog and ate", "quick brown fox jumped over the lazy dog and ate pie", ... "the quick brown fox jumped", ... "brown fox jumped", ... "jumped", "quick", "brown", ... "pie"]).
  • Construct a suffix tree T from the array A's contents
  • Iterate over the list L in order, and try to find each element in T
  • Once an element is found, note down the substring range from S, the match index from A, then continue iterating
  • Each time an element is matched, if the character range indexes of the element match overlaps with an element already matched, skip it and keep going

Sounds slow... And probably moderately difficult to do right.

like image 751
allquixotic Avatar asked Jul 13 '16 22:07

allquixotic


People also ask

Can you use substring on an array?

To extract a substring as an array of characters in Java, use the getChars() method. Let's say the following is our string and character array. String str = "World is not enough!"; char[] chArr = new char[10]; Now, use the getChars() method to extract a substring.


1 Answers

You can easily do that resorting to regexes alone. While the following is demonstrative and does not comply with the extensive list of requests (namely putting the results in an array and ordering them) that's straightforward to implement.

The "tricky" part would be the word-boundary delimiter \b and using groups () to capture the actual group you want to want to match.

String[] A = {"brown fox", "jumped over the", "lazy dog", "dog", "the", "fish", "quantum burrito", "ox jumped over the laz", "and ate", "ate pie"};
String S = "the quick brown fox jumped over the lazy dog and ate pie";

for(String s : A) {
    Pattern p = Pattern.compile(".*\\b(" +s+ ")\\b.*");
    Matcher m = p.matcher(S);

    while (m.find()) {
        System.out.println(m.matches() + " => " + s);
        System.out.println("    Start index: " + m.start(1));
        System.out.println("    End index: " + m.end(1));
        System.out.println("    Length: " + m.group(1).length());
    }
}

The above matches all contained strings as long as they are space delimited and outputs their start/end position within the main string.

like image 54
Frankie Avatar answered Oct 14 '22 23:10

Frankie