I'm working on a project in computational biology and I need to store an index of locuses that differ between many sequences. For now, I'm using a B+Tree for that purpose, but I guess using a bitmap index would be way faster for such a use case : only a small number of locus differ between two sequences, 1% on average, and they are nearly equally distributed along the sequence; so it seems like there is a lot of room for bitmap index compression.
My problem is that I cannot manage to find a compression method that can efficiently:
Thx in advance for your suggestions.
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy.
The first, and simplest, bit-pack is to simply adopt a bit-wise format where you have a 1-bit header followed by a known number of bits representing the value. The bit header works as follows: If it is set (1), then the value following it is encoded using 16 bits.
Check out FastBit:
https://sdm.lbl.gov/fastbit/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With