I am looking for a fast algorithm for search purpose in a huge string (it's a organism genome sequence composed of hundreds of millions to billions of chars).
There are only 4 chars {A,C,G,T} present in this string, and "A" can only pair with "T" while "C" pairs with "G".
Now I am searching for two substrings (with length constraint of both substring between {minLen, maxLen}, and interval length between {intervalMinLen, intervalMaxLen}) that can pair with one another antiparallely.
For example, The string is: ATCAG GACCA TACGC CTGAT
Constraints: minLen = 4, maxLen = 5, intervalMinLen = 9, intervalMaxLen = 10
The result should be
"ATCAG" pair with "CTGAT"
"TCAG" pair with "CTGA"
Thanks in advance.
Update: I already have the method to determine whether two string can pair with one another. The only concern is doing exhaustive search is very time consuming.
The Aho-Corasick string searching algorithm simultaneously finds all occurrences of multiple patterns in one pass through the text. On the other hand, the Boyer-Moore algorithm is understood to be the fastest algorithm for a single pattern.
The so-called naive or brute force algorithm is the most intuitive approach to the string pattern-matching problem.
Pattern-matching algorithms scan the text with the help of a window, whose size is equal to the length of the pattern. The first step is to align the left ends of the window and the text and then compare the corresponding characters of the window and the pattern; this procedure is known as attempt.
Overview. The most basic case of string searching involves one (often very long) string, sometimes called the haystack, and one (often very short) string, sometimes called the needle. The goal is to find one or more occurrences of the needle within the haystack.
I know you aren't searching for substrings, but I think it might be worthwhile to create a reversed genome string containing the matches; the task would then be to find common substrings in the two strings.
Example:
Original string
ATCAG GACCA TACGC CTGAT
Reversed string:
TAGTC CGCAT ACCAG GACTA
If you then transform the string to it's pairing values (replace T<->A and C<->G, you get something useful:
ATCAG GCGTA TGGTC CTGAT
I know that this preprocessing will be costly and consume a lot of space, but you will be able to use standard string algorithms afterwards and with the amount of comparisons you are searching, it certainly will be justfied.
When the original string and the reversed lookup string I think your problem sounds surprisingly alike to the 'longest common substring' problem which is well described. Your second preprocessing would be to construct a suffix tree to allow fast lookup of substrings.
you will end up with quadratic running times, but I doubt you can do better
Easiest solution would be to construct a Trie from the data of maximum height maxLen. Each node should also have a list of locations where that particular string was found (it will be in ascending order, if the trie is created in order).
Then, while searching, just reverse compliment the search string and go through the trie. When you find a match check if one of the matches is located in proper interval, if 'yes' then output the strings.
Let me know if you need the detailed algorithm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With