Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cartesian List Product in Haskell (Memory and Speed)

I'm trying to write a generic function cart :: [[a]] -> [[a]] for Cartesian product in order to generate the set of 9-tuples of digits from 0 to 7 (lists of 9 elements instead of actual 9-tuples). I've written several syntactically similar functions, but their performances differ drastically.

cart :: [[a]] -> [[a]]
cart [] = [[]]
cart (xs:xss) = [x:ys | x <- xs, ys <- cart xss]

cart' :: [[a]] -> [[a]]    -- Reverse binding
cart' [] = [[]]
cart' (xs:xss) = [x:ys | ys <- cart' xss, x <- xs]

xs = [0..7]

length $ cart $ replicate 9 xs    -- This is slow (4.1s) and memory hungry (1916MB total); ~75% garbage collection time
length $ sequence $ replicate 9 xs    -- Treating list as a monad; this is even slower (12s) and worse on memory (4214MB); ~75% garbage collection time
length $ cart' $ replicate 9 xs    -- This is slightly faster (3.4s), and uses very little memory (2MB); ~1.5% garbage collection time
length [[a,b,c,d,e,f,g,h,i] | a <- xs, b <- xs, c <- xs, d <- xs, e <- xs, f <- xs, g <- xs, h <- xs, i <- xs]    -- This is ultra-fast (0.5s) and uses virtually no memory (1MB); 0% garbage collection time

Is there a way to write a universal Cartesian product function that boasts around the same speed and memory efficiency as the fourth implementation? More importantly, why do these implementations differ so greatly?

Similar thread

like image 466
Feryll Avatar asked Oct 31 '22 02:10

Feryll


1 Answers

With definition like:

xs :: [Int]
xs = [0..7]

prodCart :: [[a]] -> [[a]]
prodCart [] = [[]]
prodCart (xs:xss) = concatMap (\xs' -> map (:xs') xs) (prodCart xss)

main :: IO ()
main = print $ length $
    prodCart $ replicate 9 xs

my results are

134217728
---
  13,345,129,576 bytes allocated in the heap
      13,825,832 bytes copied during GC
          44,312 bytes maximum residency (4 sample(s))
          21,224 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

./cart +RTS -s  1.69s user 0.06s system 98% cpu 1.773 total

Key insights is that as you only need length, the maximum residency should be small if cart is productive, i.e. it can generate elements of result list one by one. The definition you wrote open yourself (the last one) is such (same 44312 residency, 3.913 total time).

In cart definition the prodCart xss sublist is large, and it's persisted in memory with definition you have. It's better to prevent sharing in this case.


If we throw reverse into the place, which will prevent productive calculation of length: then the numbers will change: the order would be prodCart, cart, hand unrolled. Try yourself!

main :: IO ()
main = print $ length $ reverse $
    prodCart $ replicate 8 xs

has

16777216
---
   2,070,840,336 bytes allocated in the heap
   3,175,246,728 bytes copied during GC
     716,199,432 bytes maximum residency (12 sample(s))
      13,690,376 bytes maximum slop
            1411 MB total memory in use (0 MB lost due to fragmentation)

./cart +RTS -s  2.88s user 0.53s system 98% cpu 3.472 total

on my machine.


EDIT I put together a criterion script (which you should always do when microbenchmarking something)

{-# LANGUAGE ScopedTypeVariables #-}
-- stack --resolver=lts-6.0 install criterion
-- stack --resolver=lts-6.0 ghc -- -Wall -O2 cart.hs
import Criterion.Main

-- import Data.Traversable (sequenceA)

cart :: [[a]] -> [[a]]
cart [] = [[]]
cart (xs:xss) = [x:ys | x <- xs, ys <- cart xss]

cart' :: [[a]] -> [[a]]    -- Reverse binding
cart' [] = [[]]
cart' (xs:xss) = [x:ys | ys <- cart' xss, x <- xs]

prodCart :: [[a]] -> [[a]]
prodCart [] = [[]]
prodCart (xs:xss) = concatMap (\xs' -> map (:xs') xs) (prodCart xss)

prodCartRepl :: forall a. [a] -> Int -> [[a]]
prodCartRepl xs  = go
  where
    go :: Int -> [[a]]
    go 0 = [[]]
    go n = concatMap (\xs' -> map (:xs') xs) (go (n - 1))

handrolled :: [a] -> [[a]]
handrolled xs = [[a,b,c,d,e] | a <- xs, b <- xs, c <- xs, d <- xs, e <- xs ]

digits :: [Int]
digits = [0..7]

main :: IO ()
main = defaultMain
    [ bgroup "length"
        [ bench "cart"         $ whnf (length . cart) (replicate 5 digits)
        , bench "cart'"        $ whnf (length . cart') (replicate 5 digits)
        , bench "sequence"     $ whnf (length . sequence) (replicate 5 digits)
        , bench "sequenceA"    $ whnf (length . sequenceA) (replicate 5 digits)
        , bench "prodCart"     $ whnf (length . prodCart) (replicate 5 digits)
        -- Obviously different!
        , bench "prodCartRepl" $ whnf (length . flip prodCartRepl 5) digits 
        , bench "handrolled"   $ whnf (length . handrolled) digits
        ]
      , bgroup "all"
        [ bench "cart"         $ nf cart (replicate 5 digits)
        , bench "cart'"        $ nf cart' (replicate 5 digits)
        , bench "sequence"     $ nf sequence (replicate 5 digits)
        , bench "sequenceA"    $ nf sequenceA (replicate 5 digits)
        , bench "prodCart"     $ nf prodCart (replicate 5 digits)
        -- Obviously different!
        , bench "prodCartRepl" $ nf (flip prodCartRepl 5) digits
        , bench "handrolled"   $ nf handrolled digits
        ]
    ]

Interestingely, when I had "all" section commented out, then the results for handrolled would be 4 times better than "prodCart:

benchmarking length/handrolled
time                 141.8 μs   (140.5 μs .. 143.0 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 141.1 μs   (140.1 μs .. 142.0 μs)
std dev              3.203 μs   (2.657 μs .. 4.091 μs)
variance introduced by outliers: 17% (moderately inflated)

Yet, if I add module Main (main, handrolled, prodCart) where header, then they go worse (2.8s); and again if I {-# INLINE handrolled #-} it goes better down to 140μs again. Other definitions cannot be inlined as they are recursive, and without knowing the recursion depth the definition cannot be inlined enough times to "unroll" the loop. It seems that when handrolled is inlined and composed with length it just counts the items (length fuses with list production), so is blazingly fast. It doesn't happen with other versions.

Checking the Core (-ddump-simpl) should reveal that, but I didn't check.

The whole resutls are:

benchmarking length/cart
time                 885.1 μs   (844.2 μs .. 934.4 μs)
                     0.978 R²   (0.962 R² .. 0.991 R²)
mean                 850.2 μs   (828.3 μs .. 881.0 μs)
std dev              89.79 μs   (65.94 μs .. 128.2 μs)
variance introduced by outliers: 76% (severely inflated)

benchmarking length/cart'
time                 429.1 μs   (417.7 μs .. 441.7 μs)
                     0.995 R²   (0.992 R² .. 0.998 R²)
mean                 437.2 μs   (430.3 μs .. 447.3 μs)
std dev              26.67 μs   (21.33 μs .. 33.43 μs)
variance introduced by outliers: 55% (severely inflated)

benchmarking length/sequence
time                 1.006 ms   (970.5 μs .. 1.057 ms)
                     0.971 R²   (0.948 R² .. 0.988 R²)
mean                 1.115 ms   (1.075 ms .. 1.186 ms)
std dev              166.2 μs   (120.7 μs .. 228.5 μs)
variance introduced by outliers: 86% (severely inflated)

benchmarking length/sequenceA
time                 1.008 ms   (977.5 μs .. 1.041 ms)
                     0.990 R²   (0.982 R² .. 0.995 R²)
mean                 1.050 ms   (1.027 ms .. 1.080 ms)
std dev              90.05 μs   (70.35 μs .. 114.8 μs)
variance introduced by outliers: 66% (severely inflated)

benchmarking length/prodCart
time                 435.7 μs   (426.7 μs .. 445.2 μs)
                     0.996 R²   (0.993 R² .. 0.998 R²)
mean                 435.6 μs   (429.1 μs .. 443.3 μs)
std dev              23.63 μs   (19.21 μs .. 29.16 μs)
variance introduced by outliers: 49% (moderately inflated)

benchmarking length/prodCartRepl
time                 454.7 μs   (424.3 μs .. 502.9 μs)
                     0.968 R²   (0.947 R² .. 0.994 R²)
mean                 448.6 μs   (435.2 μs .. 466.2 μs)
std dev              51.97 μs   (37.25 μs .. 71.59 μs)
variance introduced by outliers: 82% (severely inflated)

benchmarking length/handrolled
time                 142.8 μs   (141.0 μs .. 145.8 μs)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 143.8 μs   (142.6 μs .. 145.8 μs)
std dev              5.080 μs   (3.776 μs .. 7.583 μs)
variance introduced by outliers: 33% (moderately inflated)

benchmarking all/cart
time                 2.123 ms   (2.050 ms .. 2.212 ms)
                     0.977 R²   (0.955 R² .. 0.993 R²)
mean                 2.035 ms   (1.981 ms .. 2.129 ms)
std dev              227.1 μs   (156.5 μs .. 335.3 μs)
variance introduced by outliers: 72% (severely inflated)

benchmarking all/cart'
time                 1.278 ms   (1.245 ms .. 1.318 ms)
                     0.986 R²   (0.971 R² .. 0.996 R²)
mean                 1.339 ms   (1.301 ms .. 1.393 ms)
std dev              157.7 μs   (105.3 μs .. 218.0 μs)
variance introduced by outliers: 77% (severely inflated)

benchmarking all/sequence
time                 1.772 ms   (1.726 ms .. 1.833 ms)
                     0.989 R²   (0.976 R² .. 0.998 R²)
mean                 1.799 ms   (1.765 ms .. 1.854 ms)
std dev              148.9 μs   (90.60 μs .. 234.2 μs)
variance introduced by outliers: 61% (severely inflated)

benchmarking all/sequenceA
time                 2.058 ms   (1.979 ms .. 2.143 ms)
                     0.988 R²   (0.982 R² .. 0.993 R²)
mean                 1.903 ms   (1.859 ms .. 1.952 ms)
std dev              157.6 μs   (131.6 μs .. 189.2 μs)
variance introduced by outliers: 61% (severely inflated)

benchmarking all/prodCart
time                 1.367 ms   (1.303 ms .. 1.438 ms)
                     0.988 R²   (0.980 R² .. 0.996 R²)
mean                 1.349 ms   (1.324 ms .. 1.396 ms)
std dev              118.0 μs   (74.92 μs .. 198.2 μs)
variance introduced by outliers: 65% (severely inflated)

benchmarking all/prodCartRepl
time                 1.331 ms   (1.294 ms .. 1.381 ms)
                     0.992 R²   (0.988 R² .. 0.997 R²)
mean                 1.350 ms   (1.328 ms .. 1.379 ms)
std dev              84.37 μs   (63.50 μs .. 116.7 μs)
variance introduced by outliers: 49% (moderately inflated)

benchmarking all/handrolled
time                 3.552 ms   (3.455 ms .. 3.711 ms)
                     0.986 R²   (0.972 R² .. 0.996 R²)
mean                 3.631 ms   (3.547 ms .. 3.724 ms)
std dev              281.9 μs   (226.9 μs .. 349.7 μs)
variance introduced by outliers: 51% (severely inflated)

As you see from all benchmarks, handrolled comprehension isn't actually that fast when you actually want the result, not only produce it.

like image 125
phadej Avatar answered Nov 15 '22 08:11

phadej