Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can we compress DNA string efficiently

DNA strings can be of any length comprising any combination of the 5 alphabets (A, T, G, C, N).
What could be the efficient way of compressing DNA string of alphabet comprising 5 alphabets (A, T, G, C, N). Instead of considering 3 bits per alphabet, can we compress and retrieve efficiently using less number of bits? Can anybody suggest a pseudo code for efficient compression and retrieval?

like image 651
ksn Avatar asked Aug 15 '18 13:08

ksn


3 Answers

you can if you willing to (a) have different bits size for each char and (b) you are always reading from the start and never from the middle. then, you can have a code something like:

  • A - 00
  • T - 01
  • G - 10
  • C - 110
  • N - 111

Reading from left to right you can only split a stream of bits to chars in one way. You read 2 bits at a time and if they are "11" you need to read one more bit to know what char it is.

This is based on Huffman Coding Algorithm

Note:
I don't know much about DNA, but if the probability of the chars is not equal (meaning 20% each). you should assign the shortest codes to those with the higher probability.

like image 97
Roee Gavirel Avatar answered Oct 07 '22 15:10

Roee Gavirel


You have 5 unique values, so you need a base-5 encoding (e.g. A=0, T=1, G=2, C=3, N=4).

In 32 bits you can fit log5(232) = 13 base-5 values.

In 64 bits you can fit log5(264) = 27 base-5 values.

The encoding process would be:

uint8_t *input = /* base-5 encoded DNA letters */;
uint64_t packed = 0;
for (int i = 0; i < 27; ++i) {
    packed = packed * 5 + *input++;
}

And decoding:

uint8_t *output = /* allocate buffer */;
uint64_t packed = /* next encoded chunk */;
for (int i = 0; i < 27; ++i) {
    *output++ = packed % 5;
    packed /= 5;
}
like image 39
rustyx Avatar answered Oct 07 '22 16:10

rustyx


There are plenty of methods to compress, but the main question is what data you want to compress? 1. Raw unaligned sequenced data from the sequencing machine (fastq) 2. Aligned data (sam/bam/cram) 3. Reference genomes

  1. You should reorder your reads putting reads from the close genome positions close to each other. For instance, this would allow usual gzip compress 3 times better. There are many ways to do this. You can for instance align fastq to bam and than export back to fastq. Use Suffix Tree/Array to find similar reads, the way most aligners work (needs a lot of memory). Use minimizers - very fast, low memory solution, but not good for long reads with many errors. Good results are from debruijn graph construction, which is used for the purpose (aka de-novo alignment). Statistical coding like huffman / arithmetic would compress to 1/3 (one can pass huffman stream to binary arithmetic coder to gain another 20%).
  2. The best results are from reference-based compressions here - just store differences between a reference and the aligned read.
  3. Little can be done here. Using statistical coding you can get 2-3 bits per nucleotide.
like image 1
dim Avatar answered Oct 07 '22 15:10

dim