Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to extract uppercase characters in Java

I am currently working with DNA sequences in the form of strings where introns are in lowercase and exons in uppercase characters. The aim of the method is to retrieve the exons in the form of a String as fast as possible.

Example of a sequence :

ATGGATGACAGgtgagaggacactcgggtcccagccccaggctctgccctcaggaagggggtcagctctcaggggcatctccctctcacagcccagccctggggatgatgtgggagccccatttatacacggtgcctccttctctcctagAGCCTACATAG

My first version was using the String replaceAll() method but it was particularly slow:

public String getExons(String sequence) {
    return sequence.replaceAll("[atgc]", "");
}

So I tried a new version which improved performance but remains fairly slow:

public String getExons(String sequence) {
    StringBuilder exonBuilder = new StringBuilder();

    for (int i = 0; i < sequence.length(); i++) {
        char c = sequence.charAt(i);
        if (c == 'A' || c == 'T' || c == 'G' || c == 'C') exonBuilder.append(c);
    }
    return exonBuilder.toString();

Is there another way to do it which would improve performance ?

like image 277
Thibault Robin Avatar asked Sep 02 '16 11:09

Thibault Robin


People also ask

How do you print all uppercase letters in Java?

Java String toUpperCase() Method The toUpperCase() method converts a string to upper case letters. Note: The toLowerCase() method converts a string to lower case letters.

How do you find capital letters in Java?

To check whether a character is in Uppercase or not in Java, use the Character. isUpperCase() method.

How can I remove uppercase from a string in Java?

Create regular expressions to remove uppercase, lowercase, special, numeric, and non-numeric characters from the string as mentioned below: regexToRemoveUpperCaseCharacters = “[A-Z]” regexToRemoveLowerCaseCharacters = “[a-z]” regexToRemoveSpecialCharacters = “[^A-Za-z0-9]”


2 Answers

You need to use a char array with double pointer trick. I get this results in my machine:

Edit: updated with warmup phase. Java is OpenJDK 8 from Ubuntu 14 LTS

Edit2: hash table is the fastest by far margin.

Edit3: I had a bug in my code. The double pointer trick is the fastest.

GTCtgACgGT
getExons1: 1068
getExons2: 377
getExons3: 313
getExons3b: 251
getExons4: 586
getExons5: 189
getExons6: 671

Edit4: Running the benchmarks with JMH with a 1M DNA String. The result is consistent with my previous benchmark regarding "x better than y" and the worst being the regexp, the best being the double pointer, and the second best being the naive 3B:

Benchmark                  Mode  Cnt    Score   Error  Units
MyBenchmark.benchExons1   thrpt  200   33.659 ± 1.036  ops/s
MyBenchmark.benchExons2   thrpt  200  107.095 ± 4.074  ops/s
MyBenchmark.benchExons3a  thrpt  200  118.543 ± 3.779  ops/s
MyBenchmark.benchExons3b  thrpt  200  163.717 ± 4.602  ops/s
MyBenchmark.benchExons4   thrpt  200   69.942 ± 2.019  ops/s
MyBenchmark.benchExons5   thrpt  200  191.142 ± 5.307  ops/s
MyBenchmark.benchExons6   thrpt  200   57.654 ± 1.963  ops/s

Edit5: With 10 MB Strings:

Benchmark                  Mode  Cnt   Score   Error  Units
MyBenchmark.benchExons1   thrpt  200   4.640 ± 0.068  ops/s
MyBenchmark.benchExons2   thrpt  200  13.451 ± 0.161  ops/s
MyBenchmark.benchExons3a  thrpt  200  15.379 ± 0.232  ops/s
MyBenchmark.benchExons3b  thrpt  200  19.550 ± 0.181  ops/s
MyBenchmark.benchExons4   thrpt  200   8.510 ± 0.147  ops/s
MyBenchmark.benchExons5   thrpt  200  24.343 ± 0.331  ops/s
MyBenchmark.benchExons6   thrpt  200   7.339 ± 0.074  ops/s

The code:

package org.sample;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;

import java.util.HashMap;
import java.util.Random;

@State(Scope.Thread)
public class MyBenchmark {

    String DNA;

    public MyBenchmark() {
        DNA = buildRandomDNA(1000000);
    }

    static String letters = "ATGCatgc";

    public static String buildRandomDNA(int size) {
        StringBuilder builder = new StringBuilder(size);
        Random r = new Random();

        for (int i = 0; i < size; ++i) {
            builder.append(letters.charAt(r.nextInt(letters.length())));
        }

        return builder.toString();
    }

    @Benchmark
    public void benchExons1() {
        getExons1(DNA);
    }

    @Benchmark
    public void benchExons2() {
        getExons2(DNA);
    }

    @Benchmark
    public void benchExons3a() {
        getExons3a(DNA);
    }

    @Benchmark
    public void benchExons3b() {
        getExons3b(DNA);
    }

    @Benchmark
    public void benchExons4() {
        getExons4(DNA);
    }

    @Benchmark
    public void benchExons5() {
        getExons5(DNA);
    }

    @Benchmark
    public void benchExons6() {
        getExons6(DNA);
    }

    public static String getExons1(String sequence) {
        return sequence.replaceAll("[atgc]", "");
    }

    public static String getExons2(String sequence) {
        StringBuilder exonBuilder = new StringBuilder();

        for (int i = 0; i < sequence.length(); i++) {
            char c = sequence.charAt(i);
            if (c == 'A' || c == 'T' || c == 'G' || c == 'C')
                exonBuilder.append(c);
        }
        return exonBuilder.toString();
    }

    public static String getExons3a(String sequence) {
        StringBuilder exonBuilder = new StringBuilder();

        for (int i = 0; i < sequence.length(); i++) {
            char c = sequence.charAt(i);
            if (c <= 'Z') {
                exonBuilder.append((char) c);
            }
        }

        return exonBuilder.toString();
    }

    public static String getExons3b(String sequence1) {
        char[] sequence = sequence1.toCharArray();
        StringBuilder exonBuilder = new StringBuilder();

        for (int i = 0; i < sequence.length; i++) {
            if (sequence[i] <= 'Z') {
                exonBuilder.append(sequence[i]);
            }
        }

        return exonBuilder.toString();
    }

    public static HashMap<String, String> M = new HashMap<String, String>();

    public static void buildTable() {
        for (int a = 0; a < letters.length(); ++a) {
            for (int b = 0; b < letters.length(); ++b) {
                for (int c = 0; c < letters.length(); ++c) {
                    for (int d = 0; d < letters.length(); ++d) {
                        String key = "" + letters.charAt(a) + letters.charAt(b) + letters.charAt(c) + letters.charAt(d);
                        M.put(key, getExons1(key));
                    }
                }
            }
        }
    }

    public static String getExons4(String sequence1) {
        char[] sequence = sequence1.toCharArray();
        StringBuilder exonBuilder = new StringBuilder();

        for (int i = 0; i < sequence.length; i += 4) {
            exonBuilder.append(M.get(new String(sequence, i, 4)));
        }

        return exonBuilder.toString();
    }

    public static String getExons5(String sequence1) {
        char[] sequence = sequence1.toCharArray();
        int p = 0;

        for (int i = 0; i < sequence.length; i++) {
            if (sequence[i] <= 'Z') {
                sequence[p] = sequence[i];
                ++p;
            }
        }

        return new String(sequence, 0, p);
    }

    public static int dnatoint(char[] s, int start, int len) {
        int key = 0;
        for (; len > 0; len--, start++) {
            switch (s[start]) {
            case 'A': key = (key << 3) | 0; break;
            case 'C': key = (key << 3) | 1; break;
            case 'G': key = (key << 3) | 2; break;
            case 'T': key = (key << 3) | 3; break;
            case 'a': key = (key << 3) | 4; break;
            case 'c': key = (key << 3) | 5; break;
            case 'g': key = (key << 3) | 6; break;
            case 't': key = (key << 3) | 7; break;
            }
        }
        return key;
    }

    public static String[] M2 = new String[8*8*8*8];

    public static void buildTable2() {
        for (int a = 0; a < letters.length(); ++a) {
            for (int b = 0; b < letters.length(); ++b) {
                for (int c = 0; c < letters.length(); ++c) {
                    for (int d = 0; d < letters.length(); ++d) {
                        String key = "" + letters.charAt(a) + letters.charAt(b) + letters.charAt(c) + letters.charAt(d);
                        M2[dnatoint(key.toCharArray(), 0, 4)] = getExons1(key);
                    }
                }
            }
        }
    }

    public static String getExons6(String sequence1) {
        char[] sequence = sequence1.toCharArray();
        StringBuilder exonBuilder = new StringBuilder();

        assert (sequence.length % 4) == 0;

        for (int i = 0; i < sequence.length; i += 4) {
            exonBuilder.append(M2[dnatoint(sequence, i, 4)]);
        }

        return exonBuilder.toString();
    }

    static {
        buildTable();
        buildTable2();
    }

    //@Benchmark
    public void testMethod() {
        // This is a demo/sample template for building your JMH benchmarks. Edit as needed.
        // Put your benchmark code here.
    }

}
like image 94
vz0 Avatar answered Oct 14 '22 00:10

vz0


Use below:

 public String getExons(String sequence) {
    StringBuilder exonBuilder = new StringBuilder();

    for (int i = 0; i < sequence.length(); i++) {
        char c = sequence.charAt(i);
        if((int)c>=65 && (int)c <=90){
           exonBuilder.append((char)c);
         }
     }

    return exonBuilder.toString();
like image 36
Mayur Maheshwari Avatar answered Oct 14 '22 00:10

Mayur Maheshwari