Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lists, boxed vectors, and unboxed vectors for heavy scientific computations

Tags:

haskell

I'm told that I should use unboxed vectors for heavy scientific computations (simulations that run for hours and even days) instead of lists, or even boxed vectors.

  1. Is this true?
  2. Are there other data structures than lists, boxed vectors, or unboxed vectors that are common?
  3. Can you explain the difference between boxed and unboxed vectors?
like image 275
hsk6 Avatar asked Jan 09 '16 11:01

hsk6


1 Answers

A couple of things to understand. First, boxed vs unboxed:

  • If you have a "boxed" number, basically you have a pointer to a number. (Or possibly a pointer to an unevaluated expression that will produce a number if it ever gets evaluated.)

  • If you have an "unboxed" number, you have literally the number itself.

For example, on a 64-bit platform, Char is a 64-bit pointer to somewhere on the Haskell heap that holds either a 32-bit Unicode codepoint, or an arbitrarily large unevaluated expression that produces such. On the other hand, Char# is literally just a 32-bit integer. It could be on the heap somewhere, or it could be in a CPU register or something.

Basically Char# is what C would think of as an integer, whereas Char is what C would think of as a pointer to... actually a data structure that tells you if it's evaluated or not, and if it is evaluated, the integer is in there. Which is quite a bit more complicated.

An important thing to notice is that a boxed value is potentially unevaluated; the pointer can point to the result, or the expression for producing the result. An unboxed value can never be unevaluated. For example, a 32-bit integer cannot possibly store an arbitrarily huge expression for calculating an integer; it can only store the integer itself. So boxed vs unboxed is unavoidably tangled up with lazy vs strict.

Lazy evaluation can allow you to avoid calculating result you don't actually need. (Infinite lists and so forth.) But if you actually do always need all the results, strict evaluation is actually faster.

Next, lists vs vectors:

  • A Haskell list [Double] (or whatever) is a singly-linked list of pointers to double-precision floats. Each float could be unevaluated, and each link in the list could also be unevaluated. As you can imagine, this has no kind of cache coherence whatsoever!

  • A vector is an array of stuff. This means no infinite vectors; the size of the vector must be known at creation time. It also means that to modify an immutable vector, you must copy the entire thing, which is very inefficient in time and space. Otherwise you must use mutable vectors, which negates some of the benefits of functional programming. On the other hand, vectors have awesome cache coherence!

Now, a boxed vector is basically an array of pointers to the actual data. An unboxed vector is an array of the actual data. Wanna take a guess which of those has the best cache behaviour? As a side effect, an unboxed vector is also strict, which — if you need the whole vector — is going to be faster.

So you see, unboxed vectors place certain restrictions on you, but potentially give best performance.

Having just said all that, GHC performs all kinds of tricky optimisations that can radically alter the performance of the code versus what it "appears" to be doing. GHC may turn lazy code into strict code, and it may perform "list fusion", where chains of functions that loop over lists get turns into a single tight loop. But then again, chains of vector operations get fused as well, so... in reality, actual performance rather depends on what you're trying to do.

like image 71
MathematicalOrchid Avatar answered Sep 24 '22 13:09

MathematicalOrchid