Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using GHC's profiling stats/charts to identify trouble-areas / improve performance of Haskell code

TL;DR: Based on the Haskell code and it's associated profiling data below, what conclusions can we draw that let us modify/improve it so we can narrow the performance gap vs. the same algorithm written in imperative languages (namely C++ / Python / C# but the specific language isn't important)?

Background

I wrote the following piece of code as an answer to a question on a popular site which contains many questions of a programming and/or mathematical nature. (You've probably heard of this site, whose name is pronounced "oiler" by some, "yoolurr" by others.) Since the code below is a solution to one of the problems, I'm intentionally avoiding any mention of the site's name or any specific terms in the problem. That said, I'm talking about problem one hundred and three.

(In fact, I've seen many solutions in the site's forums from resident Haskell wizards :P)

Why did I choose to profile this code?

This was the first problem (on said site) in which I encountered a difference in performance (as measured by execution time) between Haskell code vs. C++/Python/C# code (when both use a similar algorithm). In fact, it was the case for all of the problems (thus far; I've done ~100 problems but not sequentially) that an optimized Haskell code was pretty much neck-and-neck with the fastest C++ solutions, ceteris paribus for the algorithm, of course.

However, the posts in the forum for this particular problem would indicate that the same algorithm in these other languages typically require at most one or two seconds, with the longest taking 10-15 sec (assuming the same starting parameters; I'm ignoring the very naive algorithms that take 2-3 min+). In contrast, the Haskell code below required ~50 sec on my (decent) computer (with profiling disabled; with profiling enabled, it takes ~2 min, as you can see below; note: the exec time was identical when compiling with -fllvm). Specs: i5 2.4ghz laptop, 8gb RAM.

In an effort to learn Haskell in a way that it can become a viable substitute to the imperative languages, one of my aims in solving these problems is learning to write code that, to the extent possible, has performance that's on par with those imperative languages. In that context, I still consider the problem as yet unsolved by me (since there's nearly a ~25x difference in performance!)

What have I done so far?

In addition to the obvious step of streamlining the code itself (to the best of my ability), I've also performed the standard profiling exercises that are recommended in "Real World Haskell".

But I'm having a hard time drawing conclusions that that tell me which pieces need to be modified. That's where I'm hoping folks might be able to help provide some guidance.

Description of the problem:

I'd refer you to the website of problem one hundred and three on the aforementioned site but here's a brief summary: the goal is to find a group of seven numbers such that any two disjoint subgroups (of that group) satisfy the following two properties (I'm trying to avoid using the 's-e-t' word for reasons mentioned above...):

  • no two subgroups sum to the same amount
  • the subgroup with more elements has a larger sum (in other words, the sum of the smallest four elements must be greater than the sum of the largest three elements).

In particular, we are trying to find the group of seven numbers with the smallest sum.

My (admittedly weak) observations

A warning: some of these comments may well be totally wrong but I wanted to atleast take a stab at interpreting the profiling data based on what I read in Real World Haskell and other profiling-related posts on SO.

  • There does indeed seem to be an efficiency issue seeing as how one-third of the time is spent doing garbage collection (37.1%). The first table of figures shows that ~172gb is allocated in the heap, which seems horrible... (Maybe there's a better structure / function to use for implementing the dynamic programming solution?)
  • Not surprisingly, the vast majority (83.1%) of time is spent checking rule 1: (i) 41.6% in the value sub-function, which determines values to fill in the dynamic programming ("DP") table, (ii) 29.1% in the table function, which generates the DP table and (iii) 12.4% in the rule1 function, which checks the resulting DP table to make sure that a given sum can only be calculated in one way (i.e., from one subgroup).
  • However, I did find it surprising that more time was spent in the value function relative to the table and rule1 functions given that it's the only one of the three which doesn't construct an array or filter through a large number of elements (it's really only performing O(1) lookups and making comparisons between Int types, which you'd think would be relatively quick). So this is a potential problem area. That said, it's unlikely that the value function is driving the high heap-allocation

Frankly, I'm not sure what to make of the three charts.

Heap profile chart (i.e., the first char below):

  • I'm honestly not sure what is represented by the red area marked as Pinned. It makes sense that the dynamic function has a "spiky" memory allocation because it's called every time the construct function generates a tuple that meets the first three criteria and, each time it's called, it creates a decently large DP array. Also, I'd think that the allocation of memory to store the tuples (generated by construct) wouldn't be flat over the course of the program.
  • Pending clarification of the "Pinned" red area, I'm not sure this one tells us anything useful.

Allocation by type and allocation by constructor:

  • I suspect that the ARR_WORDS (which represents a ByteString or unboxed Array according to the GHC docs) represents the low-level execution of the construction of the DP array (in the table function). Nut I'm not 100% sure.
  • I'm not sure what's the FROZEN and STATIC pointer categories correspond to.
  • Like I said, I'm really not sure how to interpret the charts as nothing jumps out (to me) as unexpected.

The code and the profiling results

Without further ado, here's the code with comments explaining my algorithm. I've tried to make sure the code doesn't run off of the right-side of the code-box - but some of the comments do require scrolling (sorry).

{-# LANGUAGE NoImplicitPrelude #-}
{-# OPTIONS_GHC -Wall #-}

import CorePrelude
import Data.Array
import Data.List
import Data.Bool.HT ((?:))
import Control.Monad (guard)

main = print (minimum construct)

cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int

--we enumerate tuples that are potentially valid and then
--filter for valid ones; we perform the most computationally
--expensive step (i.e., rule 1) at the very end
construct :: [[Int]]
construct = {-# SCC "construct" #-} do
  a <- [flr..cap]                         --1st: we construct potentially valid tuples while applying a
  b <- [a+step..cap]                      --constraint on the upper bound of any element as implied by rule 2
  c <- [b+step..a+b-1]
  d <- [c+step..a+b-1]
  e <- [d+step..a+b-1]
  f <- [e+step..a+b-1]
  g <- [f+step..a+b-1]
  guard (a + b + c + d - e - f - g > 0)   --2nd: we screen for tuples that completely conform to rule 2
  let nn = [g,f,e,d,c,b,a]
  guard (sum nn < 285)                    --3rd: we screen for tuples of a certain size (a guess to speed things up)
  guard (rule1 nn)                        --4th: we screen for tuples that conform to rule 1
  return nn

rule1 :: [Int] -> Bool
rule1 nn = {-# SCC "rule1" #-} 
    null . filter ((>1) . snd)           --confirm that there's only one subgroup that sums to any given sum
  . filter ((length nn==) . snd . fst)   --the last column us how many subgroups sum to a given sum
  . assocs                               --run the dynamic programming algorithm and generate a table
  $ dynamic nn

dynamic :: [Int] -> Array (Int,Int) Int
dynamic ns = {-# SCC "dynamic" #-} table
  where
    (len, maxSum) = (length &&& sum) ns
    table = array ((0,0),(maxSum,len)) 
      [ ((s,i),x) | s <- [0..maxSum], i <- [0..len], let x = value (s,i) ]
    elements = listArray (0,len) (0:ns)
    value (s,i)
      | i == 0 || s == 0 = 0
      | s ==  m = table ! (s,i-1) + 1
      | s > m = s <= sum (take i ns) ?: 
          (table ! (s,i-1) + table ! ((s-m),i-1), 0)
      | otherwise = 0
      where
        m = elements ! i

Stats on heap allocation, garbage collection and time elapsed:

% ghc -O2 --make 103_specialsubset2.hs -rtsopts -prof -auto-all -caf-all -fforce-recomp
[1 of 1] Compiling Main             ( 103_specialsubset2.hs, 103_specialsubset2.o )
Linking 103_specialsubset2 ...
% time ./103_specialsubset2.hs +RTS -p -sstderr
zsh: permission denied: ./103_specialsubset2.hs
./103_specialsubset2.hs +RTS -p -sstderr  0.00s user 0.00s system 86% cpu 0.002 total
% time ./103_specialsubset2 +RTS -p -sstderr
SOLUTION REDACTED
 172,449,596,840 bytes allocated in the heap
  21,738,677,624 bytes copied during GC
         261,128 bytes maximum residency (74 sample(s))
          55,464 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     327548 colls,     0 par   27.34s   41.64s     0.0001s    0.0092s
  Gen  1        74 colls,     0 par    0.02s    0.02s     0.0003s    0.0013s

  INIT    time    0.00s  (  0.01s elapsed)
  MUT     time   53.91s  ( 70.60s elapsed)
  GC      time   27.35s  ( 41.66s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   81.26s  (112.27s elapsed)

  %GC     time      33.7%  (37.1% elapsed)

  Alloc rate    3,199,123,974 bytes per MUT second

  Productivity  66.3% of total user, 48.0% of total elapsed

./103_specialsubset2 +RTS -p -sstderr  81.26s user 30.90s system 99% cpu 1:52.29 total

Stats on time spent per cost-centre:

    Wed Dec 17 23:21 2014 Time and Allocation Profiling Report  (Final)

       103_specialsubset2 +RTS -p -sstderr -RTS

    total time  =       15.56 secs   (15565 ticks @ 1000 us, 1 processor)
    total alloc = 118,221,354,488 bytes  (excludes profiling overheads)

COST CENTRE     MODULE  %time %alloc

dynamic.value   Main     41.6   17.7
dynamic.table   Main     29.1   37.8
construct       Main     12.9   37.4
rule1           Main     12.4    7.0
dynamic.table.x Main      1.9    0.0


                                                                    individual     inherited
COST CENTRE               MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN                      MAIN                     55           0    0.0    0.0   100.0  100.0
 main                     Main                    111           0    0.0    0.0     0.0    0.0
 CAF:main1                Main                    108           0    0.0    0.0     0.0    0.0
  main                    Main                    110           1    0.0    0.0     0.0    0.0
 CAF:main2                Main                    107           0    0.0    0.0     0.0    0.0
  main                    Main                    112           0    0.0    0.0     0.0    0.0
 CAF:main3                Main                    106           0    0.0    0.0     0.0    0.0
  main                    Main                    113           0    0.0    0.0     0.0    0.0
 CAF:construct            Main                    105           0    0.0    0.0   100.0  100.0
  construct               Main                    114           1    0.6    0.0   100.0  100.0
   construct              Main                    115           1   12.9   37.4    99.4  100.0
    rule1                 Main                    123      282235    0.6    0.0    86.5   62.6
     rule1                Main                    124      282235   12.4    7.0    85.9   62.6
      dynamic             Main                    125      282235    0.2    0.0    73.5   55.6
       dynamic.elements   Main                    133      282235    0.3    0.1     0.3    0.1
       dynamic.len        Main                    129      282235    0.0    0.0     0.0    0.0
       dynamic.table      Main                    128      282235   29.1   37.8    72.9   55.5
        dynamic.table.x   Main                    130   133204473    1.9    0.0    43.8   17.7
         dynamic.value    Main                    131   133204473   41.6   17.7    41.9   17.7
          dynamic.value.m Main                    132   132640003    0.3    0.0     0.3    0.0
       dynamic.maxSum     Main                    127      282235    0.0    0.0     0.0    0.0
       dynamic.(...)      Main                    126      282235    0.1    0.0     0.1    0.0
    dynamic               Main                    122      282235    0.0    0.0     0.0    0.0
    construct.nn          Main                    121    12683926    0.0    0.0     0.0    0.0
 CAF:main4                Main                    102           0    0.0    0.0     0.0    0.0
  construct               Main                    116           0    0.0    0.0     0.0    0.0
   construct              Main                    117           0    0.0    0.0     0.0    0.0
 CAF:cap                  Main                    101           0    0.0    0.0     0.0    0.0
  cap                     Main                    119           1    0.0    0.0     0.0    0.0
 CAF:flr                  Main                    100           0    0.0    0.0     0.0    0.0
  flr                     Main                    118           1    0.0    0.0     0.0    0.0
 CAF:step_r1dD            Main                     99           0    0.0    0.0     0.0    0.0
  step                    Main                    120           1    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Handle.FD         96           0    0.0    0.0     0.0    0.0
 CAF                      GHC.Conc.Signal          93           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding          91           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding.Iconv    82           0    0.0    0.0     0.0    0.0

Heap profile:

heap profile

Allocation by type:

enter image description here

Allocation by constructors:

TBD

like image 919
iceman Avatar asked Dec 18 '14 07:12

iceman


1 Answers

There is a lot that can be said. In this answer I'll just comment on the nested list comprehensions in the construct function.

To get an idea on what's going on in construct we'll isolate it and compare it to a nested loop version that you would write in an imperative language. We've removed the rule1 guard to test only the generation of lists.

-- List.hs -- using list comprehensions

import Control.Monad

cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int

construct :: [[Int]]
construct =  do
  a <- [flr..cap]                         
  b <- [a+step..cap]                      
  c <- [b+step..a+b-1]
  d <- [c+step..a+b-1]
  e <- [d+step..a+b-1]
  f <- [e+step..a+b-1]
  g <- [f+step..a+b-1]
  guard (a + b + c + d - e - f - g > 0)
  guard (a + b + c + d + e + f + g < 285)
  return  [g,f,e,d,c,b,a]
  -- guard (rule1 nn)

main = do
  forM_ construct print


-- Loops.hs -- using imperative looping

import Control.Monad

loop a b f = go a
  where go i | i > b     = return ()
             | otherwise = do f i; go (i+1)

cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int

main =
  loop flr cap $ \a ->
  loop (a+step) cap $ \b ->
  loop (b+step) (a+b-1) $ \c ->
  loop (c+step) (a+b-1) $ \d ->
  loop (d+step) (a+b-1) $ \e ->
  loop (e+step) (a+b-1) $ \f ->
  loop (f+step) (a+b-1) $ \g ->
    if (a+b+c+d-e-f-g > 0) && (a+b+c+d+e+f+g < 285)
      then print [g,f,e,d,c,b,a]
      else return ()

Both programs were compiled with ghc -O2 -rtsopts and run with prog +RTS -s > out.

Here is a summary of the results:

                          Lists.hs    Loops.hs
  Heap allocation        44,913 MB    2,740 MB
  Max. Residency            44,312      44,312
  %GC                        5.8 %       1.7 %
  Total Time             9.48 secs   1.43 secs

As you can see, the loop version, which is the way you would write this in a language like C, wins in every category.

The list comprehension version is cleaner and more composable but also less performant than direct iteration.

like image 104
ErikR Avatar answered Nov 16 '22 03:11

ErikR