I am currently working with DNA sequences in the form of strings where introns are in lowercase and exons in uppercase characters. The aim of the method is to retrieve the exons in the form of a String as fast as possible.
Example of a sequence :
ATGGATGACAGgtgagaggacactcgggtcccagccccaggctctgccctcaggaagggggtcagctctcaggggcatctccctctcacagcccagccctggggatgatgtgggagccccatttatacacggtgcctccttctctcctagAGCCTACATAG
My first version was using the String replaceAll() method but it was particularly slow:
public String getExons(String sequence) {
return sequence.replaceAll("[atgc]", "");
}
So I tried a new version which improved performance but remains fairly slow:
public String getExons(String sequence) {
StringBuilder exonBuilder = new StringBuilder();
for (int i = 0; i < sequence.length(); i++) {
char c = sequence.charAt(i);
if (c == 'A' || c == 'T' || c == 'G' || c == 'C') exonBuilder.append(c);
}
return exonBuilder.toString();
Is there another way to do it which would improve performance ?
Java String toUpperCase() Method The toUpperCase() method converts a string to upper case letters. Note: The toLowerCase() method converts a string to lower case letters.
To check whether a character is in Uppercase or not in Java, use the Character. isUpperCase() method.
Create regular expressions to remove uppercase, lowercase, special, numeric, and non-numeric characters from the string as mentioned below: regexToRemoveUpperCaseCharacters = “[A-Z]” regexToRemoveLowerCaseCharacters = “[a-z]” regexToRemoveSpecialCharacters = “[^A-Za-z0-9]”
You need to use a char array with double pointer trick. I get this results in my machine:
Edit: updated with warmup phase. Java is OpenJDK 8 from Ubuntu 14 LTS
Edit2: hash table is the fastest by far margin.
Edit3: I had a bug in my code. The double pointer trick is the fastest.
GTCtgACgGT
getExons1: 1068
getExons2: 377
getExons3: 313
getExons3b: 251
getExons4: 586
getExons5: 189
getExons6: 671
Edit4: Running the benchmarks with JMH with a 1M DNA String. The result is consistent with my previous benchmark regarding "x better than y" and the worst being the regexp, the best being the double pointer, and the second best being the naive 3B:
Benchmark Mode Cnt Score Error Units
MyBenchmark.benchExons1 thrpt 200 33.659 ± 1.036 ops/s
MyBenchmark.benchExons2 thrpt 200 107.095 ± 4.074 ops/s
MyBenchmark.benchExons3a thrpt 200 118.543 ± 3.779 ops/s
MyBenchmark.benchExons3b thrpt 200 163.717 ± 4.602 ops/s
MyBenchmark.benchExons4 thrpt 200 69.942 ± 2.019 ops/s
MyBenchmark.benchExons5 thrpt 200 191.142 ± 5.307 ops/s
MyBenchmark.benchExons6 thrpt 200 57.654 ± 1.963 ops/s
Edit5: With 10 MB Strings:
Benchmark Mode Cnt Score Error Units
MyBenchmark.benchExons1 thrpt 200 4.640 ± 0.068 ops/s
MyBenchmark.benchExons2 thrpt 200 13.451 ± 0.161 ops/s
MyBenchmark.benchExons3a thrpt 200 15.379 ± 0.232 ops/s
MyBenchmark.benchExons3b thrpt 200 19.550 ± 0.181 ops/s
MyBenchmark.benchExons4 thrpt 200 8.510 ± 0.147 ops/s
MyBenchmark.benchExons5 thrpt 200 24.343 ± 0.331 ops/s
MyBenchmark.benchExons6 thrpt 200 7.339 ± 0.074 ops/s
The code:
package org.sample;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import java.util.HashMap;
import java.util.Random;
@State(Scope.Thread)
public class MyBenchmark {
String DNA;
public MyBenchmark() {
DNA = buildRandomDNA(1000000);
}
static String letters = "ATGCatgc";
public static String buildRandomDNA(int size) {
StringBuilder builder = new StringBuilder(size);
Random r = new Random();
for (int i = 0; i < size; ++i) {
builder.append(letters.charAt(r.nextInt(letters.length())));
}
return builder.toString();
}
@Benchmark
public void benchExons1() {
getExons1(DNA);
}
@Benchmark
public void benchExons2() {
getExons2(DNA);
}
@Benchmark
public void benchExons3a() {
getExons3a(DNA);
}
@Benchmark
public void benchExons3b() {
getExons3b(DNA);
}
@Benchmark
public void benchExons4() {
getExons4(DNA);
}
@Benchmark
public void benchExons5() {
getExons5(DNA);
}
@Benchmark
public void benchExons6() {
getExons6(DNA);
}
public static String getExons1(String sequence) {
return sequence.replaceAll("[atgc]", "");
}
public static String getExons2(String sequence) {
StringBuilder exonBuilder = new StringBuilder();
for (int i = 0; i < sequence.length(); i++) {
char c = sequence.charAt(i);
if (c == 'A' || c == 'T' || c == 'G' || c == 'C')
exonBuilder.append(c);
}
return exonBuilder.toString();
}
public static String getExons3a(String sequence) {
StringBuilder exonBuilder = new StringBuilder();
for (int i = 0; i < sequence.length(); i++) {
char c = sequence.charAt(i);
if (c <= 'Z') {
exonBuilder.append((char) c);
}
}
return exonBuilder.toString();
}
public static String getExons3b(String sequence1) {
char[] sequence = sequence1.toCharArray();
StringBuilder exonBuilder = new StringBuilder();
for (int i = 0; i < sequence.length; i++) {
if (sequence[i] <= 'Z') {
exonBuilder.append(sequence[i]);
}
}
return exonBuilder.toString();
}
public static HashMap<String, String> M = new HashMap<String, String>();
public static void buildTable() {
for (int a = 0; a < letters.length(); ++a) {
for (int b = 0; b < letters.length(); ++b) {
for (int c = 0; c < letters.length(); ++c) {
for (int d = 0; d < letters.length(); ++d) {
String key = "" + letters.charAt(a) + letters.charAt(b) + letters.charAt(c) + letters.charAt(d);
M.put(key, getExons1(key));
}
}
}
}
}
public static String getExons4(String sequence1) {
char[] sequence = sequence1.toCharArray();
StringBuilder exonBuilder = new StringBuilder();
for (int i = 0; i < sequence.length; i += 4) {
exonBuilder.append(M.get(new String(sequence, i, 4)));
}
return exonBuilder.toString();
}
public static String getExons5(String sequence1) {
char[] sequence = sequence1.toCharArray();
int p = 0;
for (int i = 0; i < sequence.length; i++) {
if (sequence[i] <= 'Z') {
sequence[p] = sequence[i];
++p;
}
}
return new String(sequence, 0, p);
}
public static int dnatoint(char[] s, int start, int len) {
int key = 0;
for (; len > 0; len--, start++) {
switch (s[start]) {
case 'A': key = (key << 3) | 0; break;
case 'C': key = (key << 3) | 1; break;
case 'G': key = (key << 3) | 2; break;
case 'T': key = (key << 3) | 3; break;
case 'a': key = (key << 3) | 4; break;
case 'c': key = (key << 3) | 5; break;
case 'g': key = (key << 3) | 6; break;
case 't': key = (key << 3) | 7; break;
}
}
return key;
}
public static String[] M2 = new String[8*8*8*8];
public static void buildTable2() {
for (int a = 0; a < letters.length(); ++a) {
for (int b = 0; b < letters.length(); ++b) {
for (int c = 0; c < letters.length(); ++c) {
for (int d = 0; d < letters.length(); ++d) {
String key = "" + letters.charAt(a) + letters.charAt(b) + letters.charAt(c) + letters.charAt(d);
M2[dnatoint(key.toCharArray(), 0, 4)] = getExons1(key);
}
}
}
}
}
public static String getExons6(String sequence1) {
char[] sequence = sequence1.toCharArray();
StringBuilder exonBuilder = new StringBuilder();
assert (sequence.length % 4) == 0;
for (int i = 0; i < sequence.length; i += 4) {
exonBuilder.append(M2[dnatoint(sequence, i, 4)]);
}
return exonBuilder.toString();
}
static {
buildTable();
buildTable2();
}
//@Benchmark
public void testMethod() {
// This is a demo/sample template for building your JMH benchmarks. Edit as needed.
// Put your benchmark code here.
}
}
Use below:
public String getExons(String sequence) {
StringBuilder exonBuilder = new StringBuilder();
for (int i = 0; i < sequence.length(); i++) {
char c = sequence.charAt(i);
if((int)c>=65 && (int)c <=90){
exonBuilder.append((char)c);
}
}
return exonBuilder.toString();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With