I am trying to write a code that will tell me if one string is a substring of another string. The catch is that it does not matter if there are characters in between and the only characters that matter are 'A'
, 'T'
, 'G'
and 'C'
. For instance:
"TxxAA" is a subsequence of "CTyyGCACA"
"pln" is a subsequence of "oiu"
"TAA" is NOT a subsequence of "TCCCA"
Currently I am doing
private boolean subSequence(DNASequence other) {
other.fix();
boolean valid = false;
String t = other.toString();
data = dataFix(data);
int index = 0;
for (int i = 0; i < data.length(); i++) {
for (int j = 0; j < t.length(); j++) {
if(data.charAt(i) == t.charAt(j)) {
if( j >= index) {
valid = true;
index = j;
t = t.replace(t.charAt(j), '_');
} else {
valid = false;
}
}
}
}
if (data == "" || t == "" ) {
valid = true;
}
return valid;
}
private String dataFix(String data) {
for (int i = 0; i < data.length(); i += 1) {
char ch = data.charAt(i);
if (("ATGC".indexOf(ch) < 0))
data = data.replace(data.charAt(i), ' ');
}
data = data.replaceAll(" ", "").trim();
return data;
}
the fix()
and dataFix()
methods erase all characters besides "ATGC"
. As the code iterates through, it is replacing the character in t
that matches with data.charAt(i)
with a _
so that it does not rematch the same letter (I was having that problem).
Currently, what is happening is that the replace function is replacing every char in the string not just the char at the specific index (which is what it is supposed to do) What is a better way to approach this problem? Where am I going wrong? Thank you.
To answer the first question 'What is a better way to approach this problem?', I would recommend using Regular Expressions (or regex). Regular Expressions are a way to express patterns in text.
For this example where you have a search term:
TxxAA
a regex to describe the patter you are looking for could be:
T.*A.*A
Without going into too much detail the term .* is an expression for any number (zero or more) of any characters. So this regex describes a pattern which is: T; then any characters; A; then any characters; and then A.
Your original question becomes "does a sequence have a sub-sequence with the pattern T.*A.*A?". Java has a regex library built in and you can use the Pattern and Matcher objects to answer this question.
Some sample code as a demonstration:
public class DnaMatcher {
static boolean isSearchChar(char c) {
return 'A' == c || 'T' == c || 'G' == c || 'C' == c;
}
static Pattern preparePattern(String searchSequence) {
StringBuilder pattern = new StringBuilder();
boolean first = false;
for (char c : searchSequence.toCharArray()) {
if (isSearchChar(c)) {
if (first) {
first = false;
} else {
pattern.append(".*");
}
pattern.append(c);
}
}
return Pattern.compile(pattern.toString());
}
static boolean contains(String sequence, String searchSequence) {
Pattern pattern = preparePattern(searchSequence);
Matcher matcher = pattern.matcher(sequence);
return matcher.find();
}
public static void main(String...none) throws Exception {
System.out.println(contains("CTyyGCACA", "TxxAA")); // true
System.out.println(contains("TCCCA", "TAA")); // false
}
}
You can see that the preparePattern matches prepares the regex expression as discussed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With