Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do functional languages represent algebraic data types in memory?

If you were writing a bioinformatics algorithm in Haskell, you'd probably use an algebraic data type to represent the nucleotides:

data Nucleotide = A | T | C | G

You'd do similarly in Standard ML or OCaml, I assume (I've never really used either).

A value of type Nucleotide can clearly be contained in two bits. However, doing so would cause access times to be slower than if you used one byte per Nucleotide value, as you'd need to select out the two bits of interest using binary operators.

There is therefore an inherent tradeoff that the compiler must make between memory efficiency and computational efficiency when deciding how to represent algebraic data types. Furthermore, the representation of algebraic data types in memory is made more complicated by the fact that the value can be of variable size:

data Maybe a = Just a | Nothing

Clearly, a Maybe a value of the form Just a is logically larger than a value of the form Nothing. In an extreme example like this:

data Hulk a b c d e = Big a b c d e | Little

you definitely wouldn't want to have to store in a Little value null pointers or zero values for the five values contained in Big values. I'm assuming that you'd just use heap-allocated memory of variable size, with a constructor ID at the beginning (for example, 0 for Big and 1 for Little). However, if you wanted to store Hulk values on the stack (a faster representation), you'd have to store the blank memory along with Little values so that all values of the type Hulk are the same size. Another tradeoff.

Simon Marlow answered my general question in regard to GHC in a previous StackOverflow question. However, I have three related questions that remain unanswered:

  • Do Standard ML (SML/NJ and MLton) and OCaml use the same technique?
  • If so, do any less common compilers of these languages (or their siblings) experiment with other techniques?
  • Is there a reasonably easy way (ideally a pragma or option flag) in these languages to use a more memory efficient representation, like the two-bit representation of Nucleotide? Such memory efficiency is necessary for many bioinformatics applications; if each Nucleotide had to be one byte, high-performance bioinformatics algorithms would have to resort to manual bit-fiddling.
like image 711
Mike Avatar asked Jul 18 '14 02:07

Mike


1 Answers

There's no single answer: data types are abstract structures and can be implemented in a variety of ways at the implementor's discretion. In practice considerations such as separate compilation tend to constrain things somewhat.

For the specific case of packing a data type containing only nullary constructors into as few bits as possible, you could proceed by defining functions from data type to small integer and back. An integral type hidden by an abstract type (or in Haskell, newtype) would also be a reasonable choice. Packing and unpacking the small integers into whatever aggregate form you are working with would be your job.

By the way, Real World OCaml has a very nice chapter on the representation of OCaml values (rough summary: not hugely different from GHC for the purposes of this question).

like image 166
gsg Avatar answered Oct 10 '22 22:10

gsg