How can we compress DNA string efficiently

Question

DNA strings can be of any length comprising any combination of the 5 alphabets (A, T, G, C, N).
What could be the efficient way of compressing DNA string of alphabet comprising 5 alphabets (A, T, G, C, N). Instead of considering 3 bits per alphabet, can we compress and retrieve efficiently using less number of bits? Can anybody suggest a pseudo code for efficient compression and retrieval?

Roee Gavirel · Accepted Answer

you can if you willing to (a) have different bits size for each char and (b) you are always reading from the start and never from the middle. then, you can have a code something like:

A - 00
T - 01
G - 10
C - 110
N - 111

Reading from left to right you can only split a stream of bits to chars in one way. You read 2 bits at a time and if they are "11" you need to read one more bit to know what char it is.

This is based on Huffman Coding Algorithm

Note:
I don't know much about DNA, but if the probability of the chars is not equal (meaning 20% each). you should assign the shortest codes to those with the higher probability.

rustyx · Answer

You have 5 unique values, so you need a base-5 encoding (e.g. A=0, T=1, G=2, C=3, N=4).

In 32 bits you can fit log₅(2³²) = 13 base-5 values.

In 64 bits you can fit log₅(2⁶⁴) = 27 base-5 values.

The encoding process would be:

uint8_t *input = /* base-5 encoded DNA letters */;
uint64_t packed = 0;
for (int i = 0; i < 27; ++i) {
    packed = packed * 5 + *input++;
}

And decoding:

uint8_t *output = /* allocate buffer */;
uint64_t packed = /* next encoded chunk */;
for (int i = 0; i < 27; ++i) {
    *output++ = packed % 5;
    packed /= 5;
}

dim · Answer

There are plenty of methods to compress, but the main question is what data you want to compress? 1. Raw unaligned sequenced data from the sequencing machine (fastq) 2. Aligned data (sam/bam/cram) 3. Reference genomes

You should reorder your reads putting reads from the close genome positions close to each other. For instance, this would allow usual gzip compress 3 times better. There are many ways to do this. You can for instance align fastq to bam and than export back to fastq. Use Suffix Tree/Array to find similar reads, the way most aligners work (needs a lot of memory). Use minimizers - very fast, low memory solution, but not good for long reads with many errors. Good results are from debruijn graph construction, which is used for the purpose (aka de-novo alignment). Statistical coding like huffman / arithmetic would compress to 1/3 (one can pass huffman stream to binary arithmetic coder to gain another 20%).
The best results are from reference-based compressions here - just store differences between a reference and the aligned read.
Little can be done here. Using statistical coding you can get 2-3 bits per nucleotide.

How can we compress DNA string efficiently

Tags:

c++

algorithm

compression

dna-sequence

lossless-compression

ksn

3 Answers

Roee Gavirel

rustyx

dim

Recent Activity

Donate For Us

How can we compress DNA string efficiently

Tags:

c++

algorithm

compression

dna-sequence

lossless-compression

ksn

3 Answers

Roee Gavirel

rustyx

dim

Related questions

Recent Activity

Donate For Us