Haskell Performance by Example

Question

I have these basic types in my code:

newtype Foo (m :: Factored) = Foo Int64 deriving (NFData)

foo :: forall m . (Fact m) => Foo m -> Foo m

class T t where t :: (Fact m ) => t m -> t m

instance T Foo where t = foo

newtype Bar t (m :: Factored) = Bar (t m) deriving (NFData)

bar :: (Fact m, T t) => Bar t m -> Bar t m
bar (Bar v) = Bar $ t v

(ignore Fact and Factored for the moment). I'm benchmarking the code at different levels, comparing the performance of foo, t, and bar. In the benchmarks, t = foo, and bar just applies t through a newtype. Thus their runtime should be essentially identical, but criterion reports that foo takes 9.2ns, t takes about twice that at 17.45ns, and bar takes a whopping 268.1ns.

I've experimented with adding INLINABLE and even a SPECIALIZE pragma, but they aren't helping. I want to believe that GHC has some magic syntax/optimization/etc that can be consistently applied to solve these types of performance issues. For example, I've seen cases where writing code in point-free style has dramatic performance implications.

The complete code can be found here. I promise it's not intimidating. The modules are:

Foo: defines Foo, foo, and T
Bar: defines Bar and bar
FooBench: defines a benchmark for foo
TBench: defines a benchmark for t
BarBench: defines a benchmark for bar
Main: runs the three benchmarks
Factored: defines Fact and Factored using singletons

Most of the modules are tiny; I defined the three benchmarks in separate files so that I could examine their core. I generated the core for the three *Bench modules and aligned them the best I could. They're only ~250 lines each, and the first ~200 lines are identical, up to renaming. The problem is that I don't know what to make of the last 50 or so lines. The (core) diff for FooBench vs TBench is here, the diff for TBench vs BarBench is here, and the diff for FooBench vs BarBench is here.

Just a few of the questions I have:

At a high level, what is the essential difference between the core files? I'm looking for something like "Here you can see that GHC isn't inlining the call to x." What should I be looking for?
What can be done to make the three benchmarks all run in about 9.2ns? GHC optimizations? INLINE/INLINABLE pragmas? SPECIALIZE pragmas that I missed? (You aren't allowed to specialized for F128::Factored; in my real library this value may be reified at runtime.) Restricting/delaying inlining to a particular phase?

Although I'm looking for an actual solution to make the benchmarks all fast, it's possible that tricks that work for this example won't scale to my real library. As a result, I'm also looking for the "high-level" explanation of why a particular technique should work.

cchalmers · Accepted Answer

First, looking at bar:

bar :: (Fact m, T t) => Bar t m -> Bar t m
bar (Bar v) = Bar $ t v

we can write this without needing the argument using coerce:

bar :: (Fact m, T t) => Bar t m -> Bar t m
bar = (coerce :: (t m -> t m) -> Bar t m -> Bar t m) t

this (as we'd hope) gets bar performing the same as t. (In fact the core for TBench and BarBench is exactly the same, excluding type signatures).

I'm not entirely sure why, but using INLINE instead of INLINEABLE makes t and bar perform the same as foo. I'm not an expert but it's normally better to use INLINE for small functions that you're sure you want to inline.

That said I think some of these issues are from how criterion does benchmarks to stop ghc cheating. For instance, writing bench "Bar" $ nf (GHC.Magic.inline bar) x on your original code has bar performing the same as foo. I suspect a "real" program wouldn't be so delicate.

Haskell Performance by Example

Tags:

performance

haskell

ghc

crockeea

1 Answers

cchalmers

Recent Activity

Donate For Us

Haskell Performance by Example

Tags:

performance

haskell

ghc

crockeea

1 Answers

cchalmers

Related questions

Recent Activity

Donate For Us