Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine whether a string is a subsequence of another string regardless of characters in between?

Tags:

java

string

I am trying to write a code that will tell me if one string is a substring of another string. The catch is that it does not matter if there are characters in between and the only characters that matter are 'A', 'T', 'G' and 'C'. For instance:

"TxxAA" is     a subsequence of "CTyyGCACA"
"pln"   is     a subsequence of "oiu"
"TAA"   is NOT a subsequence of "TCCCA" 

Currently I am doing

private boolean subSequence(DNASequence other) {

    other.fix();
    boolean valid = false;
    String t = other.toString();
    data = dataFix(data);
    int index = 0;

    for (int i = 0; i < data.length(); i++) {
        for (int j = 0; j < t.length(); j++) {
            if(data.charAt(i) == t.charAt(j)) {                        
                if( j >= index) {
                    valid = true;
                    index = j;
                    t = t.replace(t.charAt(j), '_');
                } else {
                    valid = false;
                }
            }
        }

    }

    if (data == "" || t == "" ) {
        valid = true;
    }
    return valid;
}

private String dataFix(String data) {
    for (int i = 0; i < data.length(); i += 1) {
        char ch = data.charAt(i);
        if (("ATGC".indexOf(ch) < 0))
            data = data.replace(data.charAt(i), ' ');        
    }
    data = data.replaceAll(" ", "").trim();
    return data;
}

the fix() and dataFix() methods erase all characters besides "ATGC". As the code iterates through, it is replacing the character in t that matches with data.charAt(i) with a _ so that it does not rematch the same letter (I was having that problem).

Currently, what is happening is that the replace function is replacing every char in the string not just the char at the specific index (which is what it is supposed to do) What is a better way to approach this problem? Where am I going wrong? Thank you.

like image 379
jake Avatar asked Jan 04 '23 08:01

jake


1 Answers

To answer the first question 'What is a better way to approach this problem?', I would recommend using Regular Expressions (or regex). Regular Expressions are a way to express patterns in text.

For this example where you have a search term:

TxxAA

a regex to describe the patter you are looking for could be:

T.*A.*A

Without going into too much detail the term .* is an expression for any number (zero or more) of any characters. So this regex describes a pattern which is: T; then any characters; A; then any characters; and then A.

Your original question becomes "does a sequence have a sub-sequence with the pattern T.*A.*A?". Java has a regex library built in and you can use the Pattern and Matcher objects to answer this question.

Some sample code as a demonstration:

public class DnaMatcher {

    static boolean isSearchChar(char c) {
        return 'A' == c || 'T' == c || 'G' == c || 'C' == c;
    }

    static Pattern preparePattern(String searchSequence) {
        StringBuilder pattern = new StringBuilder();
        boolean first = false;
        for (char c : searchSequence.toCharArray()) {
            if (isSearchChar(c)) {
                if (first) {
                    first = false;
                } else {
                    pattern.append(".*");
                }
                pattern.append(c);
            }
        }
        return Pattern.compile(pattern.toString());
    }

    static boolean contains(String sequence, String searchSequence) {
        Pattern pattern = preparePattern(searchSequence);
        Matcher matcher = pattern.matcher(sequence);
        return matcher.find();
    }

    public static void main(String...none) throws Exception {
        System.out.println(contains("CTyyGCACA", "TxxAA")); // true
        System.out.println(contains("TCCCA", "TAA")); // false
    }
}

You can see that the preparePattern matches prepares the regex expression as discussed.

like image 86
Evan Jones Avatar answered Jan 06 '23 03:01

Evan Jones