Assume I have the following input and that my implementation language is Java:
An array, A, with the contents: ["brown fox", "jumped over the", "lazy dog", "dog", "the", "fish", "quantum burrito", "ox jumped over the laz", "and ate", "ate pie"]
A string, S, with the contents: "the quick brown fox jumped over the lazy dog and ate pie"
(first character index 0, last character index 55)
I need to (as efficiently as is practical on a typical computer) assemble a list of substrings of the string S that are contained (entirely) within an element of the array A, and get those in descending order. I also need to know the starting and ending character index within the string S of each match. ... But with some constraints.
The following constraints and peculiarities apply to this problem:
Working this out manually just by looking at the string and array, in this example, the solution would be the following, given in the correct descending order (zero-based indexing):
Note, specifically, that "ox jumped over the laz", although it is the longest substring in A that is within S, is not matched in the result set because it violates the word boundaries of "fox" and "lazy".
Question: Am I describing a fairly standard algorithm that may exist in a library (in part or in whole; I am willing to build this out of simpler primitive building blocks) or is this something so custom that I need to implement it from scratch?
If I implement it from scratch, I think I need to take an approach broadly sketched out like the following:
["the quick brown fox jumped over the lazy dog and ate pie", "the quick brown fox jumped over the lazy dog and ate", "quick brown fox jumped over the lazy dog and ate pie", ... "the quick brown fox jumped", ... "brown fox jumped", ... "jumped", "quick", "brown", ... "pie"]
).Sounds slow... And probably moderately difficult to do right.
To extract a substring as an array of characters in Java, use the getChars() method. Let's say the following is our string and character array. String str = "World is not enough!"; char[] chArr = new char[10]; Now, use the getChars() method to extract a substring.
You can easily do that resorting to regexes alone. While the following is demonstrative and does not comply with the extensive list of requests (namely putting the results in an array and ordering them) that's straightforward to implement.
The "tricky" part would be the word-boundary delimiter \b
and using groups ()
to capture the actual group you want to want to match.
String[] A = {"brown fox", "jumped over the", "lazy dog", "dog", "the", "fish", "quantum burrito", "ox jumped over the laz", "and ate", "ate pie"};
String S = "the quick brown fox jumped over the lazy dog and ate pie";
for(String s : A) {
Pattern p = Pattern.compile(".*\\b(" +s+ ")\\b.*");
Matcher m = p.matcher(S);
while (m.find()) {
System.out.println(m.matches() + " => " + s);
System.out.println(" Start index: " + m.start(1));
System.out.println(" End index: " + m.end(1));
System.out.println(" Length: " + m.group(1).length());
}
}
The above matches all contained strings as long as they are space delimited and outputs their start/end position within the main string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With