Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are new vectors created even if the old ones aren't used anymore?

This question is about the Data.Vector package.

Given the fact that I'll never use the old value of a certain cell once the cell is updated. Will the update operation always create a new vector, reflecting the update, or will it be done as an in-place update ?

Note: I know about Data.Vector.Mutable

like image 762
haskelline Avatar asked Jul 12 '11 14:07

haskelline


4 Answers

No, but something even better can happen.

Data.Vector is built using "stream fusion". This means that if the sequence of operations that you are performing to build up and then tear down the vector can be fused away, then the Vector itself will never even be constructed and your code will turn into an optimized loop.

Fusion works by turning code that would build vectors into code that builds up and tears down streams and then puts the streams into a form that the compiler can see to perform optimizations.

So code that looks like

foo :: Int
foo = sum as
   where as, bs, cs, ds, es :: Vector Int
         as = map (*100) bs 
         bs = take 10 cs
         cs = zipWith (+) (generate 1000 id) ds
         ds = cons 1 $ cons 3 $ map (+2) es 
         es = replicate 24000 0

despite appearing to build up and tear down quite a few very large vectors can fuse all the way down to an inner loop that only calculates and adds 10 numbers.

Doing what you proposed is tricky, because it requires that you know that no references to a term exist anywhere else, which imposes a cost on any attempt to copy a reference into an environment. Moreover, it interacts rather poorly with laziness. You need to attach little affine addenda to the thunk you conspicuously didn't evaluate yet. But to do this in a multithreaded environment is scarily race prone and hard to get right.

like image 108
Edward Kmett Avatar answered Oct 16 '22 14:10

Edward Kmett


Well, how exactly should the compiler see that "the old vector is not used anywhere"? Say we have a function that changes a vector:

changeIt :: Vector Int -> Int -> Vector Int
changeIt vec n = vec // [(0,n)]

Just from this definition, the compiler cannot assume that vec represents the only reference to the vector in question. We would have to annotate the function so it can only be used in this way - which Haskell doesn't support (but Clean does, as far as I know).

So what can we do in Haskell? Let us say we have another silly function:

changeItTwice vec n = changeIt (changeIt vec n) (n+1)

Now GHC could inline changeIt, and indeed "see" that no reference to the intermediate structure escapes. But typically, you would use this information to not produce that intermediate data structure, instead directly generating the end result!

This is a pretty common optimization (for lists, there is fusion, for example) - and I think it plays pretty much exactly the role you have in mind: Limiting the number of times a data structure needs to be copied. Whether or not this approach is more flexible than in-place-updates is up for debate, but you can definitely recover a lot of performance without having to break abstraction by annotating uniqueness properties.

(However, I think that Vector currently does not, in fact, perform this specific optimization. Might need a few more optimizer rules...)

like image 30
Peter Wortmann Avatar answered Oct 16 '22 14:10

Peter Wortmann


IMHO this is certainly impossible as the GHC garbage collector may go havoc if you randomly change an object (even if it is not used anymore). That is because the object may be moved into an older generation and mutation could introduce pointers to a younger generation. If now the younger generation gets garbage collected, the object may move and thus the pointer may become invalid.

AFAIK, all mutable objects in Haskell are located on a special heap that is treated differently by the GC, so that such problems can't occur.

like image 1
fuz Avatar answered Oct 16 '22 14:10

fuz


Not necessarily. Data.Vector uses stream fusion, so depending on your use the vector may not be created at all and the program may compile to an efficient constant space loop.

This mostly applies to operations that transform the entire vector rather than just updating a single cell, though.

like image 1
hammar Avatar answered Oct 16 '22 14:10

hammar