Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the index of a set of characters in huge sequence of characters

Let's say I have an extremely large sequence of characters of A-D, 4 Billion to be exact. My goal is to find the indexes of several new sequences of letters that are set at length 30 within that large sequence of characters. The problem also increases in difficulty when the sequence you are looking for has a small error (a letter is wrong). How should I tackle this problem?

The trivial method is to iterate one letter at a time across the whole 4 Billion text file, but that'll take forever with memory running out.

I've been told to utilize a hashmap, but I'm not sure exactly what to use as my key value pair. The idea of using regex has also come up, but I'm not entirely sure if it'll work with my problem. Any help in terms of direction would be appreciated. Thanks!

Here's an illustration of what I'm asking:

like image 239
bigbitecode Avatar asked Feb 21 '23 10:02

bigbitecode


2 Answers

This is a classic problem call the longest common subsequence (LCS). There are many algorithms to solve it. The genome project does this kind of search a lot. The wiki link provided has many examples. Your threshold for errors would be a special case.

Are you doing something with gene sequencing? I ask only because you mention only 4 variables :)

like image 106
linuxuser27 Avatar answered Apr 25 '23 11:04

linuxuser27


By encoding in characters you are wasting 14 bits on every 2 you use. You could encode four nucleotide letters in just a single byte, then you'd need only half a gigabyte. As for an algorithm, you can study the code in java.lang.String.indexOf and the Wikipedia page on Boyer-Moore algorithm.

BTW you could make the search instantaneous if you employed the Lucene index for this. The idea would be to index every 30-letter subsequence as a separate document in Lucene. As for error tolerance, you'd need to use N-grams, or make a fuzzy search (in Lucene 4 there's a new algorithm to quickly locate strings with edit distance up to 2 or 3).

like image 36
Marko Topolnik Avatar answered Apr 25 '23 11:04

Marko Topolnik