We have two sets, A and B. Each one of these sets include strings. eg.: A - {"abwcd", "dwas", "www"} and B - {"opqr", "tops", "ibmd"} How can I count the subsequences that appear in all strings from set A, but in none of the strings in set B? For the example above the answer is 1 (the subsequence "w").
All this in an optimal way. I thought about using two tries, first time I put all the subsequences of all the strings in B in trie t_B and then, I start putting all the subsequences of all the strings in A in the trie t_A, without updating the trie if the same subsequence was found before in the same string (e.g.: if I have the string "aba", I don't count the subsequence "a" twice). In this way, if I find a subsequence that has n (size of A) appearances in t_A, I check if it's in t_B, and if it's not, I count it. But this is very very slow, if A and B have size 15 and the strings are about 100 characters long, my programs runs in over 1 second.
EDIT: Since any subsqeunce ends in the last character of the string or in a character before it, we don't have to generat all the subsequences, but the ones that end in the last character of the string. When I push them into the trie, I note every node with 1. So if I have the string "abcd", I only push "abcd", "bcd", "cd" and "d", since this should be the 'skeleton' of the trie. But this is not a very big optimization, I'm still looking for something better.
You shouldn't have to put all the subsequences of all the strings in A into the trie. Only put in the valid ones. Test if a sequence is valid before adding it. I'm assuming a membership test is faster than adding a new item. A smaller trie should fail membership tests faster, so this strategy is designed to trim down the trie as fast as possible.
Specifically: Put all the subsequences from the first string in A into the trie. (For efficiency, use the shortest string as the first). Keep a set of references to all the leaf nodes. Next, for all the strings in B, test each subsequence to see if it exists in A. If it does, remove that sequence and it's reference. (Start with the longest string in B to pare the trie as fast as possible).
Now you have the minimum set of possibilities to test against. For all the remaining strings in A, test each subsequence to see if it exists in the trie. If it does, mark the node as valid, else move to the next subsequence. After each string, remove all the invalid nodes from the trie, and reset the flags on the rest to invalid.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With