Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a more efficient way to assess containment of strings?

I have to execute this line of cose several million times, I wonder if there is a way to optimize it (maybe precomputing something?).

a.contains(b) || b.contains(a)

Thank you

edit: the code executed by the contains method already checks for a.length < b.length.

public static int indexOf(byte[] value, int valueCount, byte[] str, int strCount, int fromIndex) {
    byte first = str[0];
    int max = (valueCount - strCount);
    for (int i = fromIndex; i <= max; i++) {
        [...]
    }
    return -1;
}
like image 459
MauroT Avatar asked Jun 18 '19 07:06

MauroT


People also ask

How do you check if all elements of a string are same?

To find whether a string has all the same characters. Traverse the whole string from index 1 and check whether that character matches the first character of the string or not. If yes, then match until string size. If no, then break the loop.

How do you check if a string contains only one character in Python?

Using Python's "in" operator The simplest and fastest way to check whether a string contains a substring or not in Python is the "in" operator . This operator returns true if the string contains the characters, otherwise, it returns false .

Is Strcmp fast?

While comparing the two function, in a loop repeted 500'000'000 times, strcmp execute too much fast, about x750 times faster. The execution time of that function is 3.058s , while strcmp only 0.004s .


1 Answers

As I understand the task, you have to check whether a contains b or vice versa for each pair of a and b from a set of about 35 million words. That's a lot of pairs to check.

You should be able to narrow the search down considerable by precomputing which n-grams a word contains: If a contains some n-gram, then b has to contain the same n-gram if b contains a. You could e.g. precompute all the trigrams that each word in the list contains, and at the same time all the words that contain a given trigram, then you can just look up the words in those dictionaries and with some set operations get a small set of candidates to check properly.

In pseudo-code:

  • select a size for the n-grams (see below)
  • initialize a Map<String, Set<String>> ngram_to_word
  • first iteration: for each word a in your data set
    • iterate all the n-grams (e.g. using some sort of sliding window) of a
    • for each, add a to the sets of words containing those n-grams in ngrams_to_words
  • second iteration: for each word a in your data set
    • again get all the n-grams a contains
    • for each of those, get the set of words that contain that n-gram from ngrams_to_words
    • get the intersection of those sets of words
    • for each word b in that intersection that contains all the n-grams that a contains (but maybe in a different order or quantity), properly check whether b contains a

Depending on the number of letters in those n-grams (e.g. bigrams, trigrams, ...), they will be more expensive to pre-compute, in both time and space, but the effect will also be greater. In the simplest case, you could even just precompute which words contain a given letter (i.e. "1-grams"); that should be fast and already considerable narrow down the words to check. Of course, the n-grams should not be shorter than the shortest of the words in the data set, but you could even use two length of n-grams, e.g. use two maps letter_to_words and trigrams_to_words.

like image 102
tobias_k Avatar answered Nov 15 '22 05:11

tobias_k