Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why summing native lists is slower than summing church-encoded lists with `GHC -O2`?

In order to test how church-encoded lists perform against user-defiend lists and native lists, I've prepared 3 benchmarks:

User-defined lists

data List a = Cons a (List a) | Nil deriving Show
lenumTil n        = go n Nil where
    go 0 result   = result
    go n result   = go (n-1) (Cons (n-1) result)
lsum Nil          = 0
lsum (Cons h t)   = h + (lsum t)

main = print (lsum (lenumTil (100000000 :: Int)))

Native lists

main = print $ sum ([0..100000000-1] :: [Int])

Church lists

fsum   = (\ a -> (a (+) 0))
fenumTil n cons nil = go n nil where
    go 0 result    = result
    go n result    = go (n-1) (cons (n-1) result)
main = print $ (fsum (fenumTil (100000000 :: Int)) :: Int)

The benchmark results are unexpected:

User-defined lists

-- 4999999950000000
-- real 0m22.520s
-- user 0m59.815s
-- sys  0m20.327s

Native Lists

-- 4999999950000000
-- real 0m0.999s
-- user 0m1.357s
-- sys  0m0.252s

Church Lists

-- 4999999950000000
-- real 0m0.010s
-- user 0m0.002s
-- sys  0m0.003s

One would expect that, with the huge amount of specific optimizations targeted to native lists, they would perform the best. Yet, church lists outperform them by a 100x factor, and outperform user-defined ADTs by a 2250x factor. I've compiled all programs with GHC -O2. I've tried replacing sum by foldl', same result. I've attempted adding user-inputs to make sure the church-list version wasn't optimized to a constant. arkeet pointed out on #haskell that, by inspecting Core, the native version has an intermediate lists, but why? Forcing allocation with an additional reverse, all 3 perform roughly the same.

like image 946
MaiaVictor Avatar asked Aug 19 '15 05:08

MaiaVictor


1 Answers

GHC 7.10 has call arity analysis, which lets us define foldl in terms of foldr and thus let left folds, including sum, participate in fusion. GHC 7.8 also defines sum with foldl but it can't fuse the lists away. Thus GHC 7.10 performs optimally and identically to the Church version.

The Church version is child's play to optimize in either GHC versions. We just have to inline (+) and 0 into fenumTil, and then we have a patently tail-recursive go which can be readily unboxed and then turned into a loop by the code generator.

The user-defined version is not tail-recursive and it works in linear space, which wrecks performance, of course.

like image 145
András Kovács Avatar answered Oct 07 '22 01:10

András Kovács