Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting all string combinations by given maximal Hamming distance (number of mismatches) in Java

Is there an algorithm go generate all possible string combinations of a string (DNA Sequence) by a given number of maximal allowed positions that can variate (maximal Mismatches, maximal Hamming distance)?

The alphabet is {A,C,T,G}.

Example for a string AGCC and maximal number of Mismatches 2:

Hamming distance is 0
  {AGCC}
Hamming distance is 1
  {CGCC, TGCC, GGCC, AACC, ACCC, ATCC, AGAC, AGTC, ..., AGCG}
Hamming distance is 2
  {?}

One possible approach would be to generate a set with all permutations of a given String, iterate over them and remove all strings with greater Hamming distance that it should be.

That approach is very ressource-eating, by a given String of 20 characters and maximal Hamming distance of 5.

Is there another, more efficient approcahes / implementations for that?

like image 761
Maximilian Wiens Avatar asked Oct 09 '13 10:10

Maximilian Wiens


1 Answers

Just use a normal permutation generation algorithm, except that you pass around the distance, decrementing it when you've got a different character.

static void permute(char[] arr, int pos, int distance, char[] candidates)
{
   if (pos == arr.length)
   {
      System.out.println(new String(arr));
      return;
   }
   // distance > 0 means we can change the current character,
   //   so go through the candidates
   if (distance > 0)
   {
      char temp = arr[pos];
      for (int i = 0; i < candidates.length; i++)
      {
         arr[pos] = candidates[i];
         int distanceOffset = 0;
         // different character, thus decrement distance
         if (temp != arr[pos])
            distanceOffset = -1;
         permute(arr, pos+1, distance + distanceOffset, candidates);
      }
      arr[pos] = temp;
   }
   // otherwise just stick to the same character
   else
      permute(arr, pos+1, distance, candidates);
}

Call with:

permute("AGCC".toCharArray(), 0, 1, "ACTG".toCharArray());

Performance note:

For a string length of 20, distance of 5 and a 5-character alphabet, there are already over 17 million candidates (assuming my code is correct).

The above code takes less than a second to go through them on my machine (without printing), but don't expect any generator to be able to generate much more than that in a reasonable amount of time, as there are simply too many possibilities.

like image 97
Bernhard Barker Avatar answered Nov 10 '22 20:11

Bernhard Barker