Let's say I have an extremely large sequence of characters of A-D, 4 Billion to be exact. My goal is to find the indexes of several new sequences of letters that are set at length 30 within that large sequence of characters. The problem also increases in difficulty when the sequence you are looking for has a small error (a letter is wrong). How should I tackle this problem?
The trivial method is to iterate one letter at a time across the whole 4 Billion text file, but that'll take forever with memory running out.
I've been told to utilize a hashmap, but I'm not sure exactly what to use as my key value pair. The idea of using regex has also come up, but I'm not entirely sure if it'll work with my problem. Any help in terms of direction would be appreciated. Thanks!
Here's an illustration of what I'm asking:
This is a classic problem call the longest common subsequence (LCS). There are many algorithms to solve it. The genome project does this kind of search a lot. The wiki link provided has many examples. Your threshold for errors would be a special case.
Are you doing something with gene sequencing? I ask only because you mention only 4 variables :)
By encoding in characters you are wasting 14 bits on every 2 you use. You could encode four nucleotide letters in just a single byte, then you'd need only half a gigabyte. As for an algorithm, you can study the code in java.lang.String.indexOf
and the Wikipedia page on Boyer-Moore algorithm.
BTW you could make the search instantaneous if you employed the Lucene index for this. The idea would be to index every 30-letter subsequence as a separate document in Lucene. As for error tolerance, you'd need to use N-grams, or make a fuzzy search (in Lucene 4 there's a new algorithm to quickly locate strings with edit distance up to 2 or 3).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With