I got an challenge to make an algorithm in java that calculates how much possible DNA chains can an string form. String can contain these 5 characters (A, G, C, T, ?)
? in the string can be (A, G, C or T) but ? may not cause an pair in the string. For example, in this string "A?G" ? could only be C or T. There can be infinite pair of question marks, since they are all characters in the end.
The function form is this
public static int chains(String base) {
// return the amount of chains
}
if the base string would be "A?C?" possible combinations would be 6 = (AGCA, AGCG ,AGCT ,ATCA ,ATCG ,ATCT)
Cases (??? - 36) (AGAG - 1) (A???T - 20)
(? - 4) (A? - 3) (?A - 3) (?? - 12) (A?A - 3) (A?C - 2) ...
Max length of the given base(pohja) string is 10!
Criteria: 1. Combinations that have two characters in a row are are illegal combinations so those don't count.
What I have so far:
public static int chains(String pohja) {
int sum = 1;
int length = pohja.length();
char[] arr = pohja.toCharArray();
int questionMarks = 0;
if (length == 1) {
if (pohja.equals("?"))
return 4;
else
return 1;
} else if (length == 2) {
boolean allQuestionMarks = true;
for (int i = 0; i < 2; i++) {
if (arr[i] != '?')
allQuestionMarks = false;
else
questionMarks++;
}
if (allQuestionMarks) return 12;
if (questionMarks == 1) {
return 3;
} else {
return 2;
}
} else {
questionMarks = 0;
for (int i = 0; i < length; i++) {
if (arr[i] == '?') questionMarks++;
}
for (int i = 1; i < length - 1; i++) {
boolean leftIsLetter = isLetter(arr[i - 1]);
boolean rightIsLetter = isLetter(arr[i + 1]);
boolean sameSides = false;
if (arr[i - 1] == arr[i + 1]) sameSides = true;
if (arr[i] != '?') { // middle is char
if (leftIsLetter && rightIsLetter) { // letter(left) == letter(right)
if (sameSides) {
// Do nothing!
} else {
sum *= 3;
}
} else if (!leftIsLetter && !rightIsLetter) { // !letter(left) == !letter(right)
} else { // letter(left) != letter(right)
}
} else { // Middle is ?
if (leftIsLetter && rightIsLetter) { // letter(left) == letter(right)
if (sameSides) {
sum *= 3;
} else {
sum *= 2;
}
} else if (!leftIsLetter && !rightIsLetter) { // !letter(left) == !letter(right)
sum *= 9;
} else { // letter(left) != letter(right)
if (arr[i - 1] == '?') { // ? is on the left
} else { // ? is on the right
sum *= 2;
}
}
}
}
}
return sum;
}
public static boolean isLetter(char c) {
boolean isLetter = false;
char[] dna = { 'A', 'G', 'C', 'T' };
for (int i = 0; i < 4; i++) {
if (c == dna[i]) isLetter = true;
}
return isLetter;
}
Yeah, I know, my code is a mess. If the length of pohja(base) is 3 or more, my algorithm will check 3 characters at a time and modify sum depending on the characters that the algorithm is checking.
Could anyone give an hint on how I can solve this? :) Thanks in advance, TuukkaX.
Note: I will keep this somewhat vague since you only asked for a hint. If you would like me to iterate further, feel free to ask.
What you need to know in order to solve this mathematically is the amount of substitutions you can make at each sequence of question marks (i.e. in the base string "A?GCT???T?G", you'd have three sequences of question marks - two containing one question mark each, and one with three). In a situation like this, the total amount of substitutions you can have is equal to the product of the amount of substitutions you can make for each of the sequences.
Simple example: In the string "A?G?", the first question mark can be replaced by two characters, while the second one can be replaced by three. So overall, that's 2*3 = 6 legal possibilities.
The challenge in calculating the result like this lies in finding out how to calculate the amount of substitutions you can make for longer sequences of question marks. I'll give you one last tip and include the solution as a spoiler: The amount of legal substitution depends on the characters before and after the question marks. I'll leave finding out in which way up to you, though.
Clarification on that:
The amount of substitutions you can make depends on whether the characters before and after the question marks are equal or not. For instance,
"A??A"has a total of 6 legal possibilities and"A??G"has 7. This needs to be taken into consideration.
And here's how to work that into a solution:
Now, how to solve something like
"A????A"? Remember, total amount of substitutions = product of substitutions for each individual sequence."A????A"is a sequence of four question marks, and the characters before and after them are equal. There are three legal possibilities to replace the second character, and each of them leaves"[G|C|T]???A"- as in, a sequence of three question marks with the previous and following character not being equal. You can keep doing this recursively to get a total amount of possible result strings. Keep in mind that question marks at the very start and end of the base string require special treatment.
In case you still can't work it out, I'll give you a possible header to a method to calculate the amount of legal substitutions for a sequence:
private int calcSequenceSubs(int length, boolean prevFollEqual)
And this could be the body:
if (prevFollEqual){
if (length == 1) return 3;
else return 3 * calcSequenceSubs(length-1, false);
} else {
if (length == 1) return 2;
else return 2 * calcSequenceSubs(length-1, false) + calcSequenceSubs(length-1, true);
}
Edit (simplified version without spoilers):
The amount of legal solutions for the entire string is equal to the product of the amount of solutions for each sequence of question marks. For instance, "A?A?A" has two sequences of question marks and each of them has three legal substitutions, so the entire string has a total of 3*3 = 9 legal solutions.
So, what needs to be done is:
The tricky part is actually caltulating the amount of legal substitutions for each of the sequences. These depend on two things: The length of the sequence (obviously) and whether the characters before and after the sequence are equal (a single question mark, for instance, has 3 possible outcomes when the previous and following character are equal and two otherwise).
Now, for longer sequences, the total amount of legal substitutions can be calculated resursively. For instance, "A??T" is a sequence of two question marks and the previous and following characters are not equal. The first question mark can be replaced by either T,G or C, resulting in either "T?T", "G?T" or "C?T". Two of those are sequences of one question mark where the previous and following character are not equal and one of them is a sequence of one question mark where the previous and following character are equal.
The pattern for the recursive algorithm is always the same - if the previous and following character of the sequence are not equal, two of the options result in a sequence where previous and following character are different and one where they're the same. Likewise, when the previous and following character in the original sequence were equal, all 3 of the options result in the next step being a sequence where previous and following character are different.
A code example of a possible solution:
public static int DNAChains(String base) {
if (base == null || base.length() == 0) {
return 0;
}
int curSequence = 0;
int totalSolutions = 1;
boolean inSequence = false;
//flag to check whether there are any sequences present.
//if not, there is one solution rather than 0
char prevChar = 'x';
char follChar = 'y';
int i = 0;
char[] chars = base.toCharArray();
//handle starting sequence if present
while (i < chars.length && chars[i] == '?') {
curSequence++;
i++;
}
if (curSequence > 0) {
//exclusively ?'s needs to be treated even differently
if (i < chars.length) {
//? at the edge can be anything, so 3*false, 1*true
//if length is 1 though, there are just 3 solutions
totalSolutions *= (curSequence > 1) ? 3 * solveSequence(curSequence - 1, false) + solveSequence(curSequence - 1, true) : 3;
curSequence = 0;
} else {
//result is 4*3^(length-1)
totalSolutions = 4* ((int) Math.pow(3, chars.length-1));
}
}
//check for sequences of question marks
for (; i < chars.length; i++) {
if (chars[i] == '?') {
if (!inSequence) {
inSequence = true;
prevChar = chars[i - 1];
//there is at least one sequence -> set flag
}
curSequence++;
} else if (inSequence) {
inSequence = false;
follChar = chars[i];
totalSolutions *= solveSequence(curSequence, prevChar == follChar);
curSequence = 0;
}
}
//check if last sequence ends at the edge of the string
//if it does, handle edge case like in the beginning
if (inSequence) {
//? at the edge can be anything, so 3*false, 1*true
//if length is 1 though, there are just 3 solutions
totalSolutions *= (curSequence > 1) ? 3 * solveSequence(curSequence - 1, false) + solveSequence(curSequence - 1, true) : 3;
}
return totalSolutions;
}//end DNAChains
private static int solveSequence(int length, boolean prevFollEqual) {
if (prevFollEqual) {
//anchor
if (length == 1) {
return 3;
} else {
return 3 * solveSequence(length - 1, false);
}
} else {
//anchor
if (length == 1) {
return 2;
} else {
return 2 * solveSequence(length - 1, false) + solveSequence(length - 1, true);
}
}
}//end solveSequence
I didn't test this thoroughly, but it seems to work. I managed to deal with the edge cases as well (not 100% sure whether I got all of them, though).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With