Finding if a string contains any string in a collection

Question

I'm trying to improve the performance of a Java-function I have that is determining whether a given search string contains >0 of the strings in a collection. This could look like premature optimization but the function is called A LOT so any speed up would be very beneficial.

The code currently looks like this:

public static boolean containsAny(String searchString, List<String> searchCollection) {
    int size = searchCollection.size();
    for (int i = 0; i < size; i++) {
        String stringInCollection = searchCollection.get(i);
        if (!Util.isNullOrEmpty(stringInCollection)) {
            // This is a performance optimization of contains.
            if (searchString.indexOf(stringInCollection, 0) > -1) {
                return true;
            }
        }
    }
    return false;
}

The list typically has about 30 elements and the same collection is reused a lot between each call.

The code above is a pretty straightforward linear search. I don't think it can be significantly improved unless we change the data structure to make it better than O(n). Are there any data structures that would enable me to do that?

Joop Eggen · Accepted Answer

// Make a regex pattern (once only):
StringBuilder pattern = new StringBuilder();
for (String sought : searchCollection) {
    if (!Util.isNullOrEmpty(sought)) {
        if (pattern.length() != 0) {
            pattern.append('|');
        }
        pattern.append(Pattern.quote(sought));
    }
}
final Pattern PATTERN = Pattern.compile("(" + pattern + ")");

This creates a pattern of alternatives like "(abc|def|ghi)". You might consider a case-insensitive search.

And in the function containsAny:

Matcher m = PATTERN.matcher(searchString);
return m.find();

Regex compilation is relatively smart. It would be comparable to using a search tree of your collection of sought words: "agent" and "agitator" to ("ag", ("ent", "itator"))

kraskevich · Answer

It is possible to speed it up significantly with Aho-Corasick algorithm.

You can build an Aho-Corasick automaton for a collection using O(total length of all strings in a collections) time and space. Then it will be possible to check if one of the strings in a collection is a substring of a given string S in O(S.length) time by traversing this automaton.

Gladwin Burboz · Answer

This is a CPU intensive operation and not long running or blocked on I/O. If you are using Java 8 you can use parallel streams to do processing in parallel as shown below. The method has been changed to use Collection instead of List to keep it more flexible.

public static boolean containsAny(final String searchString,
        final Collection<String> searchCollection) {
    return searchCollection.stream().parallel()
            .anyMatch(x -> searchString.indexOf(x) > -1);
}

Furthermore, instead of using List, a Set should be used as the underlying data structure so that duplicate entries, if any, will be eliminated.

ethanfar · Answer

I believe the best suited data structure for this is a Suffix Tree. For a string of size n, building the tree takes Theta(n), and searching a sub-string of length m in it, takes O(m).

This is one of those data structures that are extremely well suited (and intended) for searching strings. It's a very common data structure with many implementations online.

Finding if a string contains any string in a collection

Tags:

java

string

collections

data-structures

Yrlec

4 Answers

Joop Eggen

kraskevich

Gladwin Burboz

ethanfar

Recent Activity

Donate For Us

Finding if a string contains any string in a collection

Tags:

java

string

collections

data-structures

Yrlec

4 Answers

Joop Eggen

kraskevich

Gladwin Burboz

ethanfar

Related questions

Recent Activity

Donate For Us