Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inlining derived typeclass methods

Tags:

Haskell lets you derive typeclass instances, such as:

{-# LANGUAGE DeriveFunctor #-}  data Foo a = MakeFoo a a deriving (Functor) 

... but sometimes benchmarks show that performance improves if you manually implement the typeclass instance and annotate the type class method(s) with INLINE, like this:

data Foo a = MakeFoo a a  instance Functor Foo where     fmap f (MakeFoo x y) = MakeFoo (f x) (f y)     {-# INLINE fmap #-} 

Is there a way to get the best of both worlds? In other words, is there a way to derive the typeclass instance and also annotate the derived typeclass methods with INLINE?

like image 252
Gabriella Gonzalez Avatar asked Oct 02 '18 02:10

Gabriella Gonzalez


1 Answers

Though you cannot "reopen" instances in Haskell like you could with classes in dynamic languages, there are ways to ensure that functions will be aggressively inlined whenever possible by passing certain flags to GHC.

-fspecialise-aggressively removes the restrictions about which functions are specialisable. Any overloaded function will be specialised with this flag. This can potentially create lots of additional code.

-fexpose-all-unfoldings will include the (optimised) unfoldings of all functions in interface files so that they can be inlined and specialised across modules.

Using these two flags in conjunction will have nearly the same effect as marking every definition as INLINABLE apart from the fact that the unfoldings for INLINABLE definitions are not optimised.

(Source: https://wiki.haskell.org/Inlining_and_Specialisation#Which_flags_can_I_use_to_control_the_simplifier_and_inliner.3F)

These options will allow the GHC compiler to inline fmap. The -fexpose-all-unfoldings option, in particular, allows the compiler to expose the internals of Data.Functor to the rest of the program for inlining purposes (and it seems to provide the largest performance benefit). Here's a quick & dumb benchmark I threw together:

functor.hs contains this code:

{-# LANGUAGE DeriveFunctor #-} {-# LANGUAGE Strict #-}  data Foo a = MakeFoo a a deriving (Functor)  one_fmap foo = fmap (+1) foo  main = sequence (fmap (\n -> return $ one_fmap $ MakeFoo n n) [1..10000000]) 

Compiled with no arguments:

$ time ./functor   real    0m4.036s user    0m3.550s sys 0m0.485s 

Compiled with -fexpose-all-unfoldings:

$ time ./functor  real    0m3.662s user    0m3.258s sys 0m0.404s 

Here's the .prof file from this compile, to show that the call to fmap is indeed getting inlined:

    Sun Oct  7 00:06 2018 Time and Allocation Profiling Report  (Final)         functor +RTS -p -RTS      total time  =        1.95 secs   (1952 ticks @ 1000 us, 1 processor)     total alloc = 4,240,039,224 bytes  (excludes profiling overheads)  COST CENTRE MODULE SRC              %time %alloc  CAF         Main   <entire-module>  100.0  100.0                                                                        individual      inherited COST CENTRE MODULE                SRC             no.     entries  %time %alloc   %time %alloc  MAIN        MAIN                  <built-in>       44          0    0.0    0.0   100.0  100.0  CAF        Main                  <entire-module>  87          0  100.0  100.0   100.0  100.0  CAF        GHC.IO.Handle.FD      <entire-module>  84          0    0.0    0.0     0.0    0.0  CAF        GHC.IO.Encoding       <entire-module>  77          0    0.0    0.0     0.0    0.0  CAF        GHC.Conc.Signal       <entire-module>  71          0    0.0    0.0     0.0    0.0  CAF        GHC.IO.Encoding.Iconv <entire-module>  58          0    0.0    0.0     0.0    0.0 

Compiled with -fspecialise-aggressively:

$ time ./functor  real    0m3.761s user    0m3.300s sys 0m0.460s 

Compiled with both flags:

$ time ./functor  real    0m3.665s user    0m3.213s sys 0m0.452s 

These little benchmarks are by no means representative of what the performance (or filesize) will like in real code, but it definitely shows that you can force the GHC compiler to inline fmap (and that it really can have non-negligible effects on performance).

like image 181
Aearnus Avatar answered Dec 09 '22 05:12

Aearnus