Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding all pairs of sequences that differ at exactly one position

I need a data structure representing a set of sequences (all of the same, known, length) with the following non-standard operation:

Find two sequences in the set that differ at exactly one index. (Or establish that no such pair exists.)

If N is the length of the sequences and M the number of sequences, there is an obvious O(N*M*M) algorithm. I wonder if there is a standard way of solving this more efficiently. I'm willing to apply pre-processing if needed.

Bonus points if instead of returning a pair, the algorithm returns all sequences that differ at the same point.

Alternatively, I am also interested in a solution where I can check efficiently whether a particular sequence differs at one index with from any sequence in the set. If it helps, we can assume that in the set, no two sequences have that property.

Edit: you can assume N to be reasonably small. By this, I mean improvements such as O(log(N)*M*M) are not immediately useful for my use case.

like image 532
Philippe Avatar asked May 02 '13 15:05

Philippe


1 Answers

For each sequence and each position i in that sequence, calculate a hash of the sequence without position i and add it to a hash table. If there is already an entry in the table, you have found a potential pair that differs only in one position. Using rolling hashes from both start and end and combining them, you can calculate each hash in constant time. The total running time is expected O(N*M).

like image 183
Falk Hüffner Avatar answered Oct 27 '22 15:10

Falk Hüffner