Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm help! Fast algorithm in searching for a string with its partner

I am looking for a fast algorithm for search purpose in a huge string (it's a organism genome sequence composed of hundreds of millions to billions of chars).

There are only 4 chars {A,C,G,T} present in this string, and "A" can only pair with "T" while "C" pairs with "G".

Now I am searching for two substrings (with length constraint of both substring between {minLen, maxLen}, and interval length between {intervalMinLen, intervalMaxLen}) that can pair with one another antiparallely.

For example, The string is: ATCAG GACCA TACGC CTGAT

Constraints: minLen = 4, maxLen = 5, intervalMinLen = 9, intervalMaxLen = 10

The result should be

  1. "ATCAG" pair with "CTGAT"

  2. "TCAG" pair with "CTGA"

Thanks in advance.

Update: I already have the method to determine whether two string can pair with one another. The only concern is doing exhaustive search is very time consuming.

like image 210
Mavershang Avatar asked Jan 10 '12 22:01

Mavershang


People also ask

What is the fastest string search algorithm?

The Aho-Corasick string searching algorithm simultaneously finds all occurrences of multiple patterns in one pass through the text. On the other hand, the Boyer-Moore algorithm is understood to be the fastest algorithm for a single pattern.

Which algorithm is best for string matching?

The so-called naive or brute force algorithm is the most intuitive approach to the string pattern-matching problem.

What is fast pattern-matching algorithm?

Pattern-matching algorithms scan the text with the help of a window, whose size is equal to the length of the pattern. The first step is to align the left ends of the window and the text and then compare the corresponding characters of the window and the pattern; this procedure is known as attempt.

How the searching process will be taking place with related to strings?

Overview. The most basic case of string searching involves one (often very long) string, sometimes called the haystack, and one (often very short) string, sometimes called the needle. The goal is to find one or more occurrences of the needle within the haystack.


2 Answers

I know you aren't searching for substrings, but I think it might be worthwhile to create a reversed genome string containing the matches; the task would then be to find common substrings in the two strings.

Example:

Original string

  ATCAG GACCA TACGC CTGAT

Reversed string:

  TAGTC CGCAT ACCAG GACTA

If you then transform the string to it's pairing values (replace T<->A and C<->G, you get something useful:

  ATCAG GCGTA TGGTC CTGAT

I know that this preprocessing will be costly and consume a lot of space, but you will be able to use standard string algorithms afterwards and with the amount of comparisons you are searching, it certainly will be justfied.

When the original string and the reversed lookup string I think your problem sounds surprisingly alike to the 'longest common substring' problem which is well described. Your second preprocessing would be to construct a suffix tree to allow fast lookup of substrings.

you will end up with quadratic running times, but I doubt you can do better

like image 72
faester Avatar answered Oct 12 '22 13:10

faester


Easiest solution would be to construct a Trie from the data of maximum height maxLen. Each node should also have a list of locations where that particular string was found (it will be in ascending order, if the trie is created in order).

Then, while searching, just reverse compliment the search string and go through the trie. When you find a match check if one of the matches is located in proper interval, if 'yes' then output the strings.

Let me know if you need the detailed algorithm.

like image 29
ElKamina Avatar answered Oct 12 '22 13:10

ElKamina