Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell pointless performance - efficiently map multiple functions to the same data

I often have to map multiple functions to the same data. I have implemented dpMap to do this for me

dpMap fns = (`map` fns) . flip ($)

dpMap is one function, does this mean I read the data dt just once (like a circuit fed with just the same input. A pointless system reminds of a circuit; just the plumbing no data)?

As an example consider calculating the minimum and maximum of a list dt.

minimax dt = (dpMap [minimum, maximum]) dt

(I could get rid of the dt but have to use -XNoMonomorphismRestriction)

Is there a performance advantage over implementing the same function in a point-full form like this?:

minimax2 dt = [minimum dt, maximum dt]

EDIT: Is there a general implementation of dpMap which works with constant memory?

I found another nice blog post: http://www.haskellforall.com/2013/08/composable-streaming-folds.html ;hope this helps.

EDIT2: After some more context, here is a solution, even though I don't have an exact implementation of dpMap, the pattern is simple enough that it doesn't warrant a separate function:

minimax = (,) <$> minimum <*> maximum

Usage:

> minimax [1..100]
(1,100)

If you want to also calculate the sum and the length

func = (,,,) <$> minimum <*> maximum <*> sum <*> length

Usage:

> func [1..100]
(1,100,5050,100)

like image 734
GeneralBecos Avatar asked Apr 29 '13 16:04

GeneralBecos


1 Answers

TL;DR: There are no guarantees about performance in the language itself. None whatsoever. It is a compiler thing.

As the rule of thumb, a named entity will be memory resident. If it is accessed lazily by only one consumer, it is reasonable to expect it to be optimized such that the compiled program will run in constant memory.

The creation and consumption of memory cells will be interleaved, and each cell will be GC-ed after it was processed.


In minimax2 dt = [minimum dt, maximum dt], the expression [minimum dt, maximum dt] is inside the scope where the named entity dt is defined. Most probably (i.e. almost certain) that GHC will allocate it as a memory entity, i.e. once, and both dt inside the expression will refer to the same entity (point to it, as if pointers).

But as Cat Plus Plus points out in the comments, of course how is entity is accessed is quite another matter. And the two sub-expressions will each access it separately, i.e. it will be retained in memory in full. That's no good.

We can do better, and find our answer by accessing it only once, with a fold, collecting the two pieces of data as we go along. In such situation, it is almost certain that GHC will perform an optimization where this list will not be retained in memory as a whole.

This is what is usually refered to as the list being consumed lazily. When that's the case, its creation will be interleaved with that one access, and each memory cell produced will be immediately consumed and released, by GC (garbage collection), so that constant memory operation will be achieved.

But that's predicated on our ability to scan through the list only once:

{-# OPTIONS_GHC -O2 -XBangPatterns #-}

import Data.List (foldl')

minmax :: (Ord b) => [b] -> (b, b)
minmax (x:xs) = foldl' (\(!a,!b) x -> (min a x,max b x)) (x,x) xs

Bang patterns prevent thunk build-up, making evaluation of arguments more eager. Testing:

Prelude> minmax [1..6]
(1,6)
Prelude> minmax []
*** Exception: <interactive>:1:4-65: Non-exhaustive patterns in function minmax

An empty list of course has no minimum nor maximum defined.

For the optimizations to kick in, -O2 flag must be used when compiling with GHC.

like image 187
Will Ness Avatar answered Oct 18 '22 14:10

Will Ness