Why summing native lists is slower than summing church-encoded lists with `GHC -O2`?

Question

In order to test how church-encoded lists perform against user-defiend lists and native lists, I've prepared 3 benchmarks:

User-defined lists

data List a = Cons a (List a) | Nil deriving Show
lenumTil n        = go n Nil where
    go 0 result   = result
    go n result   = go (n-1) (Cons (n-1) result)
lsum Nil          = 0
lsum (Cons h t)   = h + (lsum t)

main = print (lsum (lenumTil (100000000 :: Int)))

Native lists

main = print $ sum ([0..100000000-1] :: [Int])

Church lists

fsum   = (\ a -> (a (+) 0))
fenumTil n cons nil = go n nil where
    go 0 result    = result
    go n result    = go (n-1) (cons (n-1) result)
main = print $ (fsum (fenumTil (100000000 :: Int)) :: Int)

The benchmark results are unexpected:

User-defined lists

-- 4999999950000000
-- real 0m22.520s
-- user 0m59.815s
-- sys  0m20.327s

Native Lists

-- 4999999950000000
-- real 0m0.999s
-- user 0m1.357s
-- sys  0m0.252s

Church Lists

-- 4999999950000000
-- real 0m0.010s
-- user 0m0.002s
-- sys  0m0.003s

One would expect that, with the huge amount of specific optimizations targeted to native lists, they would perform the best. Yet, church lists outperform them by a 100x factor, and outperform user-defined ADTs by a 2250x factor. I've compiled all programs with GHC -O2. I've tried replacing sum by foldl', same result. I've attempted adding user-inputs to make sure the church-list version wasn't optimized to a constant. arkeet pointed out on #haskell that, by inspecting Core, the native version has an intermediate lists, but why? Forcing allocation with an additional reverse, all 3 perform roughly the same.

András Kovács · Accepted Answer

GHC 7.10 has call arity analysis, which lets us define foldl in terms of foldr and thus let left folds, including sum, participate in fusion. GHC 7.8 also defines sum with foldl but it can't fuse the lists away. Thus GHC 7.10 performs optimally and identically to the Church version.

The Church version is child's play to optimize in either GHC versions. We just have to inline (+) and 0 into fenumTil, and then we have a patently tail-recursive go which can be readily unboxed and then turned into a loop by the code generator.

The user-defined version is not tail-recursive and it works in linear space, which wrecks performance, of course.

Why summing native lists is slower than summing church-encoded lists with `GHC -O2`?

Tags:

performance

algorithm

list

haskell

User-defined lists

Native lists

Church lists

User-defined lists

Native Lists

Church Lists

MaiaVictor

1 Answers

András Kovács

Recent Activity

Donate For Us

Why summing native lists is slower than summing church-encoded lists with `GHC -O2`?

Tags:

performance

algorithm

list

haskell

User-defined lists

Native lists

Church lists

User-defined lists

Native Lists

Church Lists

MaiaVictor

1 Answers

András Kovács

Related questions

Recent Activity

Donate For Us