Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I optimize a loop which can be fully strict

I'm trying to write a brute-force solution to Project Euler Problem #145, and I cannot get my solution to run in less than about 1 minute 30 secs.

(I'm aware there are various short-cuts and even paper-and-pencil solutions; for the purpose of this question I'm not considering those).

In the best version I've come up with so far, profiling shows that the majority of the time is spent in foldDigits. This function need not be lazy at all, and to my mind ought to be optimized to a simple loop. As you can see I've attempted to make various bits of the program strict.

So my question is: without changing the overall algorithm, is there some way to bring the execution time of this program down to the sub-minute mark?

(Or if not, is there a way to see that the code of foldDigits is as optimized as possible?)

-- ghc -O3 -threaded Euler-145.hs && Euler-145.exe +RTS -N4

{-# LANGUAGE BangPatterns #-}

import Control.Parallel.Strategies

foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f !acc !n
    | n < 10    = i
    | otherwise = foldDigits f i d
  where (d, m) = n `quotRem` 10
        !i     = f acc m

reverseNumber :: Int -> Int
reverseNumber !n
    = foldDigits accumulate 0 n
  where accumulate !v !d = v * 10 + d

allDigitsOdd :: Int -> Bool
allDigitsOdd n
    = foldDigits andOdd True n
  where andOdd !a d = a && isOdd d
        isOdd !x    = x `rem` 2 /= 0

isReversible :: Int -> Bool
isReversible n
    = notDivisibleByTen n && allDigitsOdd (n + rn)
  where rn                   = reverseNumber n
        notDivisibleByTen !x = x `rem` 10 /= 0

countRange acc start end
    | start > end = acc
    | otherwise   = countRange (acc + v) (start + 1) end
  where v = if isReversible start then 1 else 0

main
    = print $ sum $ parMap rseq cr ranges
  where max       = 1000000000
        qmax      = max `div` 4
        ranges    = [(1, qmax), (qmax, qmax * 2), (qmax * 2, qmax * 3), (qmax * 3, max)]
        cr (s, e) = countRange 0 s e
like image 957
stusmith Avatar asked Nov 06 '12 14:11

stusmith


People also ask

Which is the best way to increase the execution speed of the loop?

Preallocation: Since loops that incrementally increase the size of a data structure (eg. 'y' in the provided example) each time through the loop can adversely affect performance and memory use, you can improve code execution time by preallocating the maximum amount of space required for the array.


1 Answers

As it stands, the core that ghc-7.6.1 produces for foldDigits (with -O2) is

Rec {
$wfoldDigits_r2cK
  :: forall a_aha.
     (a_aha -> GHC.Types.Int -> a_aha)
     -> a_aha -> GHC.Prim.Int# -> a_aha
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType C(C(S))SL]
$wfoldDigits_r2cK =
  \ (@ a_aha)
    (w_s284 :: a_aha -> GHC.Types.Int -> a_aha)
    (w1_s285 :: a_aha)
    (ww_s288 :: GHC.Prim.Int#) ->
    case w1_s285 of acc_Xhi { __DEFAULT ->
    let {
      ds_sNo [Dmd=Just D(D(T)S)] :: (GHC.Types.Int, GHC.Types.Int)
      [LclId, Str=DmdType]
      ds_sNo =
        case GHC.Prim.quotRemInt# ww_s288 10
        of _ { (# ipv_aJA, ipv1_aJB #) ->
        (GHC.Types.I# ipv_aJA, GHC.Types.I# ipv1_aJB)
        } } in
    case w_s284 acc_Xhi (case ds_sNo of _ { (d_arS, m_Xsi) -> m_Xsi })
    of i_ahg { __DEFAULT ->
    case GHC.Prim.<# ww_s288 10 of _ {
      GHC.Types.False ->
        case ds_sNo of _ { (d_Xsi, m_Xs5) ->
        case d_Xsi of _ { GHC.Types.I# ww1_X28L ->
        $wfoldDigits_r2cK @ a_aha w_s284 i_ahg ww1_X28L
        }
        };
      GHC.Types.True -> i_ahg
    }
    }
    }
end Rec }

which, as you can see, re-boxes the result of the quotRem call. The problem is that no property of f is available here, and as a recursive function, foldDigits cannot be inlined.

With a manual worker-wrapper transform making the function argument static,

foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f = go
  where
    go !acc 0 = acc
    go acc n = case n `quotRem` 10 of
                 (q,r) -> go (f acc r) q

foldDigits becomes inlinable, and you get specialised versions for your uses operating on unboxed data, but no top-level foldDigits, e.g.

Rec {
$wgo_r2di :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType LL]
$wgo_r2di =
  \ (ww_s28F :: GHC.Prim.Int#) (ww1_s28J :: GHC.Prim.Int#) ->
    case ww1_s28J of ds_XJh {
      __DEFAULT ->
        case GHC.Prim.quotRemInt# ds_XJh 10
        of _ { (# ipv_aJK, ipv1_aJL #) ->
        $wgo_r2di (GHC.Prim.+# (GHC.Prim.*# ww_s28F 10) ipv1_aJL) ipv_aJK
        };
      0 -> ww_s28F
    }
end Rec }

and the effect on computation time is tangible, for the original, I got

$ ./eul145 +RTS -s -N2
608720
1,814,289,579,592 bytes allocated in the heap
     196,407,088 bytes copied during GC
          47,184 bytes maximum residency (2 sample(s))
          30,640 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     1827331 colls, 1827331 par   23.77s   11.86s     0.0000s    0.0041s
  Gen  1         2 colls,     1 par    0.00s    0.00s     0.0001s    0.0001s

  Parallel GC work balance: 54.94% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time  620.52s  (313.51s elapsed)
  GC      time   23.77s  ( 11.86s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time  644.29s  (325.37s elapsed)

  Alloc rate    2,923,834,808 bytes per MUT second

(I used -N2 since my i5 only has two physical cores), vs.

$ ./eul145 +RTS -s -N2
608720
  16,000,063,624 bytes allocated in the heap
         403,384 bytes copied during GC
          47,184 bytes maximum residency (2 sample(s))
          30,640 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     15852 colls, 15852 par    0.34s    0.17s     0.0000s    0.0037s
  Gen  1         2 colls,     1 par    0.00s    0.00s     0.0001s    0.0001s

  Parallel GC work balance: 43.86% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time  314.85s  (160.08s elapsed)
  GC      time    0.34s  (  0.17s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time  315.20s  (160.25s elapsed)

  Alloc rate    50,817,657 bytes per MUT second

  Productivity  99.9% of total user, 196.5% of total elapsed

with the modification. The running time roughly halved, and the allocations reduced 100-fold.

like image 150
Daniel Fischer Avatar answered Sep 27 '22 18:09

Daniel Fischer