If you were writing a bioinformatics algorithm in Haskell, you'd probably use an algebraic data type to represent the nucleotides:
data Nucleotide = A | T | C | G
You'd do similarly in Standard ML or OCaml, I assume (I've never really used either).
A value of type Nucleotide
can clearly be contained in two bits. However, doing so would cause access times to be slower than if you used one byte per Nucleotide
value, as you'd need to select out the two bits of interest using binary operators.
There is therefore an inherent tradeoff that the compiler must make between memory efficiency and computational efficiency when deciding how to represent algebraic data types. Furthermore, the representation of algebraic data types in memory is made more complicated by the fact that the value can be of variable size:
data Maybe a = Just a | Nothing
Clearly, a Maybe a
value of the form Just a
is logically larger than a value of the form Nothing
. In an extreme example like this:
data Hulk a b c d e = Big a b c d e | Little
you definitely wouldn't want to have to store in a Little
value null pointers or zero values for the five values contained in Big
values. I'm assuming that you'd just use heap-allocated memory of variable size, with a constructor ID at the beginning (for example, 0
for Big
and 1
for Little
). However, if you wanted to store Hulk
values on the stack (a faster representation), you'd have to store the blank memory along with Little
values so that all values of the type Hulk
are the same size. Another tradeoff.
Simon Marlow answered my general question in regard to GHC in a previous StackOverflow question. However, I have three related questions that remain unanswered:
Nucleotide
? Such memory efficiency is necessary for many bioinformatics applications; if each Nucleotide
had to be one byte, high-performance bioinformatics algorithms would have to resort to manual bit-fiddling.There's no single answer: data types are abstract structures and can be implemented in a variety of ways at the implementor's discretion. In practice considerations such as separate compilation tend to constrain things somewhat.
For the specific case of packing a data type containing only nullary constructors into as few bits as possible, you could proceed by defining functions from data type to small integer and back. An integral type hidden by an abstract type (or in Haskell, newtype
) would also be a reasonable choice. Packing and unpacking the small integers into whatever aggregate form you are working with would be your job.
By the way, Real World OCaml has a very nice chapter on the representation of OCaml values (rough summary: not hugely different from GHC for the purposes of this question).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With