I have to execute this line of cose several million times, I wonder if there is a way to optimize it (maybe precomputing something?). <code>a.contains(b) || b.contains(a)</code> Thank you edit: the code executed by the contains method already checks for a.length < b.length. <pre class="prettyprint"><code>public static int indexOf(byte[] value, int valueCount, byte[] str, int strCount, int fromIndex) { byte first = str[0]; int max = (valueCount - strCount); for (int i = fromIndex; i <= max; i++) { [...] } return -1; } </code></pre>

As I understand the task, you have to check whether <code>a</code> contains <code>b</code> or vice versa for each pair of <code>a</code> and <code>b</code> from a set of about 35 million words. That's a lot of pairs to check. You should be able to narrow the search down considerable by precomputing which n-grams a word contains: If <code>a</code> contains some n-gram, then <code>b</code> has to contain the same n-gram if <code>b</code> contains <code>a</code>. You could e.g. precompute all the trigrams that each word in the list contains, and at the same time all the words that contain a given trigram, then you can just look up the words in those dictionaries and with some set operations get a small set of candidates to check properly. In pseudo-code: <ul> <li>select a size for the n-grams (see below)</li> <li>initialize a <code>Map<String, Set<String>> ngram_to_word</code> </li> <li>first iteration: for each word <code>a</code> in your data set <ul> <li>iterate all the n-grams (e.g. using some sort of sliding window) of <code>a</code> </li> <li>for each, add <code>a</code> to the sets of words containing those n-grams in <code>ngrams_to_words</code> </li> </ul> </li> <li>second iteration: for each word <code>a</code> in your data set <ul> <li>again get all the n-grams <code>a</code> contains</li> <li>for each of those, get the set of words that contain that n-gram from <code>ngrams_to_words</code> </li> <li>get the intersection of those sets of words</li> <li>for each word <code>b</code> in that intersection that contains all the n-grams that <code>a</code> contains (but maybe in a different order or quantity), properly check whether <code>b</code> contains <code>a</code> </li> </ul> </li> </ul> Depending on the number of letters in those n-grams (e.g. bigrams, trigrams, ...), they will be more expensive to pre-compute, in both time and space, but the effect will also be greater. In the simplest case, you could even just precompute which words contain a given letter (i.e. "1-grams"); that should be fast and already considerable narrow down the words to check. Of course, the n-grams should not be shorter than the shortest of the words in the data set, but you could even use two length of n-grams, e.g. use two maps <code>letter_to_words</code> and <code>trigrams_to_words</code>.

Is there a more efficient way to assess containment of strings?

Q: How do you check if all elements of a string are same?

To find whether a string has all the same characters. Traverse the whole string from index 1 and check whether that character matches the first character of the string or not. If yes, then match until string size. If no, then break the loop.

Q: How do you check if a string contains only one character in Python?

Using Python's "in" operator The simplest and fastest way to check whether a string contains a substring or not in Python is the "in" operator . This operator returns true if the string contains the characters, otherwise, it returns false .

Q: Is Strcmp fast?

While comparing the two function, in a loop repeted 500'000'000 times, strcmp execute too much fast, about x750 times faster. The execution time of that function is 3.058s , while strcmp only 0.004s .

Tags:

java

performance

math

I have to execute this line of cose several million times, I wonder if there is a way to optimize it (maybe precomputing something?).

a.contains(b) || b.contains(a)

Thank you

edit: the code executed by the contains method already checks for a.length < b.length.

public static int indexOf(byte[] value, int valueCount, byte[] str, int strCount, int fromIndex) {
    byte first = str[0];
    int max = (valueCount - strCount);
    for (int i = fromIndex; i <= max; i++) {
        [...]
    }
    return -1;
}

459

asked Jun 18 '19 07:06

MauroT

1 Answers

As I understand the task, you have to check whether a contains b or vice versa for each pair of a and b from a set of about 35 million words. That's a lot of pairs to check.

You should be able to narrow the search down considerable by precomputing which n-grams a word contains: If a contains some n-gram, then b has to contain the same n-gram if b contains a. You could e.g. precompute all the trigrams that each word in the list contains, and at the same time all the words that contain a given trigram, then you can just look up the words in those dictionaries and with some set operations get a small set of candidates to check properly.

In pseudo-code:

select a size for the n-grams (see below)
initialize a Map<String, Set<String>> ngram_to_word
first iteration: for each word a in your data set
- iterate all the n-grams (e.g. using some sort of sliding window) of a
- for each, add a to the sets of words containing those n-grams in ngrams_to_words
second iteration: for each word a in your data set
- again get all the n-grams a contains
- for each of those, get the set of words that contain that n-gram from ngrams_to_words
- get the intersection of those sets of words
- for each word b in that intersection that contains all the n-grams that a contains (but maybe in a different order or quantity), properly check whether b contains a

Depending on the number of letters in those n-grams (e.g. bigrams, trigrams, ...), they will be more expensive to pre-compute, in both time and space, but the effect will also be greater. In the simplest case, you could even just precompute which words contain a given letter (i.e. "1-grams"); that should be fast and already considerable narrow down the words to check. Of course, the n-grams should not be shorter than the shortest of the words in the data set, but you could even use two length of n-grams, e.g. use two maps letter_to_words and trigrams_to_words.

102

answered Nov 15 '22 05:11

tobias_k

Related questions
                            
                                Android Studio Room Persistence - Should each entity have a seperate DAO, and how does that effect the repository class?
                            
                                Why elastic-search container memory usage keeps increasing with little use?
                            
                                spring cloud aws multiple sqs listener
                            
                                How to memoize recursive path of length n search
                            
                                JavaFX and Scene Builder clip scene edges despite specifying USE_COMPUTED_SIZE
                            
                                Generating All Combinations of List n Levels Deep in Java
                            
                                Why does this hexadecimal value gets different decimal value? [closed]
                            
                                Type mismatch: cannot convert from Optional<Object> to BasketDTO
                            
                                ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path?
                            
                                What's the meaning of "CGC" and "CGCT" in JDK11 "jstat -gc <PID>"?
                            
                                How to separate Business logic and metrics log?
                            
                                How to fix 'handleWindowVisibility: no activity for token android.os.BinderProxy'?
                            
                                Is it possible that converting inner join select result to nested list JSON?
                            
                                Project Reactor, using a Flux sink outside of the creation lambda
                            
                                Byte Buddy - how can make a field self type?
                            
                                CPU gets pegged - Problem with java.util.zip.ZStreamRef
                            
                                android custom user interface
                            
                                Regex to match multiple occurances IFF another string occurs
                            
                                kubernetes failed to start pod due to ContainerCannotRun
                            
                                Hitting low Java thread limit (503 threads)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With