So, I’m having a lot of trouble writing a program to make a covariance matrix from several large integer vectors, stored in separate files. I started by writing
mean xs = realToFrac (sum xs) / realToFrac (length xs)
cov xs ys = mean (zipWith (*) xs ys) - mean xs * mean ys
covmat vectors = [cov xs ys | ys <- vectors, xs <- vectors]
which worked for small input, but you can see how inefficient even just “mean” is. It keeps all the xs in memory when doing the sum, because they're going to be used by "length". That’s fixable, as so:
mean xs = realToFrac thisSum / realToFrac thisLength
where (thisSum, thisLength) = foldl' (\(s,l) y-> (s+y,l+1)) (0,0) xs
but then I run into the same problem with “cov”. When I rewrote "cov" in this style, it didn't end up using my "mean" function. And I still have one level to go up when I write the "covmat" function, which will become extremely complicated.
So, I have two goals, which seem to be in conflict:
Traverse each list once, without keeping it in memory
Break down "covmat" into simpler, meaningful functions, specifically "cov" and "mean"
I don't see any way to unify these two goals with what I know of Haskell. But conceptually it seems simple: all of these functions need to "listen" to the values of the same few lists as they come in. Is there any way in Haskell to organize it like this? If this requires a different data structure or an additional library, I'm open to that.
So, I did some digging around, and I came up with the following.
EDIT: Gist for people who'd prefer that to the SO formatting.
First, a few implementations of mean
module Means where
import Data.List
import Control.Applicative
mean :: (Fractional a1, Real a, Foldable t) => t a -> a1
mean xs = realToFrac (sum xs) / realToFrac (length xs)
mean' :: (Foldable f, Fractional a) => f a -> a
mean' = liftA2 (/) sum (fromIntegral . length)
data Pair = Pair {-# UNPACK #-}!Int {-# UNPACK #-}!Double
mean'' :: [Double] -> Double
mean'' xs = s / fromIntegral n
where
Pair n s = foldl' k (Pair 0 0) xs
k (Pair n s) x = Pair (n+1) (s+x)
The last one uses an explicit strict pair constructor. IIRC, (,)
is lazy, so this should give us better performance characteristics.
module Covariance where
import Means
covariance :: (Fractional a, Real a1) => [a1] -> [a1] -> a
covariance xs ys = mean (zipWith (*) xs ys) - mean xs * mean ys
covariance' :: Fractional a => [a] -> [a] -> a
covariance' xs ys = mean' (zipWith (*) xs ys) - mean' xs * mean' ys
covariance'' :: [Double] -> [Double] -> Double
covariance'' xs ys = mean'' (zipWith (*) xs ys) - mean'' xs * mean'' ys
covariance''' :: [Double] -> [Double] -> Double
covariance''' xs ys =
let mx = mean'' xs
my = mean'' ys
in
sum (zipWith (\x y -> (x - mx) * (y - my)) xs ys) / fromIntegral (length xs)
I tried a few versions of your cov
using each of the different mean options, then one uglier "performance" version.
I threw together a simple Main
with some hardcoded lists for testing.
module Main where
import Means
import Covariance
v1 = [1000000..2000000]
v2 = [2000000..3000000]
main :: IO ()
main = do
-- let cov = covariance v1 v2
-- let cov = covariance' v1 v2
-- let cov = covariance'' v1 v2
let cov = covariance''' v1 v2
print cov
Compiling with -rtsopts
and running with +RTS -s
, I got the following allocation information.
covariance
:
8.33335e10
531,816,984 bytes allocated in the heap
890,566,720 bytes copied during GC
148,609,912 bytes maximum residency (11 sample(s))
15,649,528 bytes maximum slop
374 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 981 colls, 0 par 0.385s 0.389s 0.0004s 0.0012s
Gen 1 11 colls, 0 par 0.445s 0.584s 0.0531s 0.2084s
INIT time 0.000s ( 0.002s elapsed)
MUT time 0.194s ( 0.168s elapsed)
GC time 0.830s ( 0.973s elapsed)
EXIT time 0.001s ( 0.029s elapsed)
module Main where
Total time 1.027s ( 1.172s elapsed)
%GC time 80.9% (83.0% elapsed)
Alloc rate 2,741,140,975 bytes per MUT second
Productivity 19.1% of total user, 16.8% of total elapsed
covariance'
:
8.333350000320508e10
723,822,456 bytes allocated in the heap
891,626,240 bytes copied during GC
185,629,664 bytes maximum residency (11 sample(s))
3,693,296 bytes maximum slop
435 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1372 colls, 0 par 0.388s 0.392s 0.0003s 0.0010s
Gen 1 11 colls, 0 par 0.388s 0.551s 0.0501s 0.1961s
INIT time 0.000s ( 0.002s elapsed)
MUT time 0.227s ( 0.202s elapsed)
GC time 0.777s ( 0.943s elapsed)
EXIT time 0.001s ( 0.029s elapsed)
Total time 1.006s ( 1.176s elapsed)
%GC time 77.2% (80.2% elapsed)
Alloc rate 3,195,430,190 bytes per MUT second
Productivity 22.8% of total user, 19.6% of total elapsed
covariance''
:
8.333350000320508e10
456,108,392 bytes allocated in the heap
394,432,096 bytes copied during GC
79,295,648 bytes maximum residency (15 sample(s))
21,161,776 bytes maximum slop
201 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 861 colls, 0 par 0.085s 0.089s 0.0001s 0.0005s
Gen 1 15 colls, 0 par 0.196s 0.274s 0.0182s 0.0681s
INIT time 0.000s ( 0.002s elapsed)
MUT time 0.124s ( 0.106s elapsed)
GC time 0.282s ( 0.362s elapsed)
EXIT time 0.001s ( 0.021s elapsed)
Total time 0.408s ( 0.491s elapsed)
%GC time 69.1% (73.7% elapsed)
Alloc rate 3,681,440,521 bytes per MUT second
Productivity 30.9% of total user, 25.9% of total elapsed
covariance'''
:
8.333349999886264e10
336,108,336 bytes allocated in the heap
202,943,312 bytes copied during GC
47,493,864 bytes maximum residency (10 sample(s))
13,578,520 bytes maximum slop
115 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 633 colls, 0 par 0.053s 0.055s 0.0001s 0.0002s
Gen 1 10 colls, 0 par 0.089s 0.131s 0.0131s 0.0472s
INIT time 0.000s ( 0.002s elapsed)
MUT time 0.095s ( 0.086s elapsed)
GC time 0.142s ( 0.186s elapsed)
EXIT time 0.001s ( 0.011s elapsed)
Total time 0.240s ( 0.286s elapsed)
%GC time 59.2% (65.1% elapsed)
Alloc rate 3,522,631,228 bytes per MUT second
Productivity 40.8% of total user, 34.1% of total elapsed
As you can see, a lot of the allocation is dependent on the mean we go with. We get the biggest boost by using mean''
with the strict pair constructor, even with the naive zipWith
implementation.
I'm working on wiring up the implementations with weigh
, so I may have some more data in a bit.
Beyond tuning the component functions, I don't know of a much more performant way of dealing with covmat
, but the strict pair constructor at least should improve your space characteristics regardless of what else you do.
EDIT: weigh
results
Case Allocated GCs
naive mean 723,716,168 1,382
applicative mean 723,714,736 1,382
optimized mean, naive zipWith 456,000,688 875
optimized mean, hand-tuned zipWith 336,000,672 642
Second EDIT:
I grabbed Gabriel's awesome foldl
to see what sort of performance we can get without having to hand-tune mean with the explicit strict pair.
import qualified Control.Foldl as L
mean''' :: [Double] -> Double
mean''' = L.fold (liftA2 (/) L.sum L.genericLength)
covariance'''' :: [Double] -> [Double] -> Double
covariance'''' xs ys = mean''' (zipWith (*) xs ys) - mean''' xs * mean''' ys
covariance''''' :: [Double] -> [Double] -> Double
covariance''''' xs ys = let mx = mean''' xs
my = mean''' ys
in
mean''' (zipWith (\x y -> (x - mx) * (y - my)) xs ys)
Allocation results:
covariance''''
:
8.333350000320508e10
336,108,272 bytes allocated in the heap
222,635,752 bytes copied during GC
61,198,528 bytes maximum residency (10 sample(s))
13,627,544 bytes maximum slop
140 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 633 colls, 0 par 0.052s 0.054s 0.0001s 0.0003s
Gen 1 10 colls, 0 par 0.105s 0.155s 0.0155s 0.0592s
INIT time 0.000s ( 0.002s elapsed)
MUT time 0.110s ( 0.099s elapsed)
GC time 0.156s ( 0.209s elapsed)
EXIT time 0.001s ( 0.014s elapsed)
Total time 0.269s ( 0.323s elapsed)
%GC time 58.1% (64.5% elapsed)
Alloc rate 3,054,641,122 bytes per MUT second
Productivity 41.8% of total user, 34.9% of total elapsed
covariance'''''
:
8.333349999886264e10
336,108,232 bytes allocated in the heap
202,942,400 bytes copied during GC
47,493,816 bytes maximum residency (10 sample(s))
13,578,568 bytes maximum slop
115 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 633 colls, 0 par 0.057s 0.059s 0.0001s 0.0003s
Gen 1 10 colls, 0 par 0.086s 0.126s 0.0126s 0.0426s
INIT time 0.000s ( 0.002s elapsed)
MUT time 0.096s ( 0.087s elapsed)
GC time 0.143s ( 0.184s elapsed)
EXIT time 0.001s ( 0.011s elapsed)
Total time 0.241s ( 0.285s elapsed)
%GC time 59.2% (64.7% elapsed)
Alloc rate 3,504,449,342 bytes per MUT second
Productivity 40.8% of total user, 34.5% of total elapsed
And weigh
results:
foldl mean 336,000,568 642
foldl mean, tuned zipWith 336,000,568 642
In summary, it looks like the foldl
implementation is your best bet. It's extremely clear about what it's doing, and pulls some really fancy tricks to stream inputs efficiently, meeting or exceeding the result of our hand-tuning. You could probably eke some extra juice out of all this using another data structure, but this is pretty good performance for the humble list. :D
Third edit:
I've never used Edward's folds
before, so I may be doing something very stupid, but I also tried out an implementation using those.
import qualified Data.Fold as F
sumL :: Num a => F.L a a
sumL = F.L id (+) 0
lengthL :: Num b => F.L a b
lengthL = F.L id (const . (+1)) 0
mean'''' :: [Double] -> Double
mean'''' = flip F.run (liftA2 (/) sumL lengthL)
covariance'''''' :: [Double] -> [Double] -> Double
covariance'''''' xs ys = mean'''' (zipWith (*) xs ys) - mean'''' xs * mean'''' ys
covariance''''''' :: [Double] -> [Double] -> Double
covariance''''''' xs ys = let mx = mean'''' xs
my = mean'''' ys
in
mean'''' (zipWith (\x y -> (x - mx) * (y - my)) xs ys)
Allocation results:
covariance''''''
:
8.333350000320508e10
456,108,488 bytes allocated in the heap
394,432,096 bytes copied during GC
79,295,648 bytes maximum residency (15 sample(s))
21,161,776 bytes maximum slop
201 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 861 colls, 0 par 0.089s 0.092s 0.0001s 0.0003s
Gen 1 15 colls, 0 par 0.198s 0.276s 0.0184s 0.0720s
INIT time 0.000s ( 0.002s elapsed)
MUT time 0.135s ( 0.119s elapsed)
GC time 0.287s ( 0.367s elapsed)
EXIT time 0.001s ( 0.019s elapsed)
Total time 0.425s ( 0.506s elapsed)
%GC time 67.6% (72.5% elapsed)
Alloc rate 3,388,218,993 bytes per MUT second
Productivity 32.3% of total user, 27.1% of total elapsed
covariance'''''''
:
8.333349999886264e10
456,108,552 bytes allocated in the heap
291,275,200 bytes copied during GC
62,670,040 bytes maximum residency (11 sample(s))
15,029,432 bytes maximum slop
172 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 862 colls, 0 par 0.068s 0.070s 0.0001s 0.0003s
Gen 1 11 colls, 0 par 0.149s 0.210s 0.0191s 0.0570s
INIT time 0.000s ( 0.002s elapsed)
MUT time 0.118s ( 0.104s elapsed)
GC time 0.217s ( 0.280s elapsed)
EXIT time 0.001s ( 0.016s elapsed)
Total time 0.337s ( 0.403s elapsed)
%GC time 64.3% (69.6% elapsed)
Alloc rate 3,870,870,585 bytes per MUT second
Productivity 35.7% of total user, 29.9% of total elapsed
And weigh
results:
folds mean 456,000,784 875
folds mean, tuned zipWith 456,000,888 871
Another EDIT: I also tried the folds
option using L'
rather than L
, but the results were the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With