Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast algorithms for finding unique sets in two very long sequences of text

I need to compare the DNA sequences of X and Y chromosomes, and find patterns (composed of around 50-75 base pairs) that are unique to the Y chromosome. Note that these sequence parts can repeat in the chromosome. This needs to be done quickly (BLAST takes 47 days, need a few hours or less). Are there any algorithms or programs in particular suited to this kind of comparison? Again, speed is the key here.

One of the reasons I put this on SO was to get perspective from people outside the specific application domain, who can put forth algorithms they use in string comparison in their daily use, that may apply to our use. So don't be shy!

like image 892
person Avatar asked Aug 27 '10 02:08

person


2 Answers

  1. Build a suffix tree S on sequence X.
  2. For each starting position i in sequence Y, look for the string Y[i..i+75] in S. If no match can be found starting at position i (i.e. if lookup fails after j < 75 nucleotides matched) then you have found a length-j string unique to Y.
  3. The smallest such string over all starting positions i is the shortest unique string (or just halt after you find any such string if you aren't interested in minimising the length).

Total time: O(|X| + m|Y|) where m is the maximum string length (e.g. m = 75).

There are probably even more efficient algorithms based on generalised suffix trees.

like image 79
j_random_hacker Avatar answered Sep 28 '22 02:09

j_random_hacker


I am assuming that you have a single X and a single Y to compare. Concatenate them, separated by a marker character that does not appear in either, to form e.g. XoY. Now form the http://en.wikipedia.org/wiki/Suffix_array in linear time.

What you get is an array of pointers to positions in the concatenated string, where the pointers are arranged so that the substrings they point to appear in alphabetical order in the array. You also get an LCP array, giving the length of the longest common prefix shared between a suffix and the suffix directly before it in the array, which is the suffix that sorts just less than it. This is in fact the longest common prefix shared between that position and ANY substring less than it, because anything with a longer common prefix and less than string[i] would sort between it and the current string[i - 1].

You can tell which original string a pointer points into from its position, because X comes before Y. You can cut the array up into alternating sub-sections of X and Y pointers. The length of the common prefix shared between pos[i] and pos[i - 1] is lcp[i]. The length of the prefix shared between pos[i] and pos[i-2] is min(lcp[i], lcp[i-1]). So if you start at the Y value just before a range of Xs you can work out the number of characters of prefix between that Y and all of the Xs in turn by stepping down the section, doing a min operation at each step. Similarly you can work out the number of characters of prefix shared between all of those Xs and the Y that appears next in the suffix array at the cost of one min per X. Ditto, of course for the Y ranges. Now do a max per entry to work out the longest prefix shared between each position in X (or Y) and any position in Y (or X).

I think you want the substrings within either X or Y which have small prefixes shared between it and any other substring of the other sex, because the strings one character longer than this starting from the same position do not appear in the other sex at all. I think once you have done the min() calculations above you can extract the N smallest prefix substrings using a heap to keep track of the N smallest entries. I think everything here takes time linear in |X| + |Y| (unless N is comparable to |X| or |Y|).

like image 24
mcdowella Avatar answered Sep 28 '22 01:09

mcdowella