I need a data structure representing a set of sequences (all of the same, known, length) with the following non-standard operation:
Find two sequences in the set that differ at exactly one index. (Or establish that no such pair exists.)
If N
is the length of the sequences and M
the number of sequences, there is an obvious O(N*M*M)
algorithm. I wonder if there is a standard way of solving this more efficiently. I'm willing to apply pre-processing if needed.
Bonus points if instead of returning a pair, the algorithm returns all sequences that differ at the same point.
Alternatively, I am also interested in a solution where I can check efficiently whether a particular sequence differs at one index with from any sequence in the set. If it helps, we can assume that in the set, no two sequences have that property.
Edit: you can assume N
to be reasonably small. By this, I mean improvements such as O(log(N)*M*M)
are not immediately useful for my use case.
For each sequence and each position i in that sequence, calculate a hash of the sequence without position i and add it to a hash table. If there is already an entry in the table, you have found a potential pair that differs only in one position. Using rolling hashes from both start and end and combining them, you can calculate each hash in constant time. The total running time is expected O(N*M).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With