How do I optimize a loop which can be fully strict

Tags:

I'm trying to write a brute-force solution to Project Euler Problem #145, and I cannot get my solution to run in less than about 1 minute 30 secs.

(I'm aware there are various short-cuts and even paper-and-pencil solutions; for the purpose of this question I'm not considering those).

In the best version I've come up with so far, profiling shows that the majority of the time is spent in foldDigits. This function need not be lazy at all, and to my mind ought to be optimized to a simple loop. As you can see I've attempted to make various bits of the program strict.

So my question is: without changing the overall algorithm, is there some way to bring the execution time of this program down to the sub-minute mark?

(Or if not, is there a way to see that the code of foldDigits is as optimized as possible?)

-- ghc -O3 -threaded Euler-145.hs && Euler-145.exe +RTS -N4

{-# LANGUAGE BangPatterns #-}

import Control.Parallel.Strategies

foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f !acc !n
    | n < 10    = i
    | otherwise = foldDigits f i d
  where (d, m) = n `quotRem` 10
        !i     = f acc m

reverseNumber :: Int -> Int
reverseNumber !n
    = foldDigits accumulate 0 n
  where accumulate !v !d = v * 10 + d

allDigitsOdd :: Int -> Bool
allDigitsOdd n
    = foldDigits andOdd True n
  where andOdd !a d = a && isOdd d
        isOdd !x    = x `rem` 2 /= 0

isReversible :: Int -> Bool
isReversible n
    = notDivisibleByTen n && allDigitsOdd (n + rn)
  where rn                   = reverseNumber n
        notDivisibleByTen !x = x `rem` 10 /= 0

countRange acc start end
    | start > end = acc
    | otherwise   = countRange (acc + v) (start + 1) end
  where v = if isReversible start then 1 else 0

main
    = print $ sum $ parMap rseq cr ranges
  where max       = 1000000000
        qmax      = max `div` 4
        ranges    = [(1, qmax), (qmax, qmax * 2), (qmax * 2, qmax * 3), (qmax * 3, max)]
        cr (s, e) = countRange 0 s e

957

asked Nov 06 '12 14:11

stusmith

1 Answers

As it stands, the core that ghc-7.6.1 produces for foldDigits (with -O2) is

Rec {
$wfoldDigits_r2cK
  :: forall a_aha.
     (a_aha -> GHC.Types.Int -> a_aha)
     -> a_aha -> GHC.Prim.Int# -> a_aha
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType C(C(S))SL]
$wfoldDigits_r2cK =
  \ (@ a_aha)
    (w_s284 :: a_aha -> GHC.Types.Int -> a_aha)
    (w1_s285 :: a_aha)
    (ww_s288 :: GHC.Prim.Int#) ->
    case w1_s285 of acc_Xhi { __DEFAULT ->
    let {
      ds_sNo [Dmd=Just D(D(T)S)] :: (GHC.Types.Int, GHC.Types.Int)
      [LclId, Str=DmdType]
      ds_sNo =
        case GHC.Prim.quotRemInt# ww_s288 10
        of _ { (# ipv_aJA, ipv1_aJB #) ->
        (GHC.Types.I# ipv_aJA, GHC.Types.I# ipv1_aJB)
        } } in
    case w_s284 acc_Xhi (case ds_sNo of _ { (d_arS, m_Xsi) -> m_Xsi })
    of i_ahg { __DEFAULT ->
    case GHC.Prim.<# ww_s288 10 of _ {
      GHC.Types.False ->
        case ds_sNo of _ { (d_Xsi, m_Xs5) ->
        case d_Xsi of _ { GHC.Types.I# ww1_X28L ->
        $wfoldDigits_r2cK @ a_aha w_s284 i_ahg ww1_X28L
        }
        };
      GHC.Types.True -> i_ahg
    }
    }
    }
end Rec }

which, as you can see, re-boxes the result of the quotRem call. The problem is that no property of f is available here, and as a recursive function, foldDigits cannot be inlined.

With a manual worker-wrapper transform making the function argument static,

foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f = go
  where
    go !acc 0 = acc
    go acc n = case n `quotRem` 10 of
                 (q,r) -> go (f acc r) q

foldDigits becomes inlinable, and you get specialised versions for your uses operating on unboxed data, but no top-level foldDigits, e.g.

Rec {
$wgo_r2di :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType LL]
$wgo_r2di =
  \ (ww_s28F :: GHC.Prim.Int#) (ww1_s28J :: GHC.Prim.Int#) ->
    case ww1_s28J of ds_XJh {
      __DEFAULT ->
        case GHC.Prim.quotRemInt# ds_XJh 10
        of _ { (# ipv_aJK, ipv1_aJL #) ->
        $wgo_r2di (GHC.Prim.+# (GHC.Prim.*# ww_s28F 10) ipv1_aJL) ipv_aJK
        };
      0 -> ww_s28F
    }
end Rec }

and the effect on computation time is tangible, for the original, I got

$ ./eul145 +RTS -s -N2
608720
1,814,289,579,592 bytes allocated in the heap
     196,407,088 bytes copied during GC
          47,184 bytes maximum residency (2 sample(s))
          30,640 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     1827331 colls, 1827331 par   23.77s   11.86s     0.0000s    0.0041s
  Gen  1         2 colls,     1 par    0.00s    0.00s     0.0001s    0.0001s

  Parallel GC work balance: 54.94% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time  620.52s  (313.51s elapsed)
  GC      time   23.77s  ( 11.86s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time  644.29s  (325.37s elapsed)

  Alloc rate    2,923,834,808 bytes per MUT second

(I used -N2 since my i5 only has two physical cores), vs.

$ ./eul145 +RTS -s -N2
608720
  16,000,063,624 bytes allocated in the heap
         403,384 bytes copied during GC
          47,184 bytes maximum residency (2 sample(s))
          30,640 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     15852 colls, 15852 par    0.34s    0.17s     0.0000s    0.0037s
  Gen  1         2 colls,     1 par    0.00s    0.00s     0.0001s    0.0001s

  Parallel GC work balance: 43.86% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time  314.85s  (160.08s elapsed)
  GC      time    0.34s  (  0.17s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time  315.20s  (160.25s elapsed)

  Alloc rate    50,817,657 bytes per MUT second

  Productivity  99.9% of total user, 196.5% of total elapsed

with the modification. The running time roughly halved, and the allocations reduced 100-fold.

150

answered Sep 27 '22 18:09

Daniel Fischer

Related questions
                            
                                How to improve the performance of ViewFlipper/ViewAnimator
                            
                                How to avoid data loss on server failure with MongoDB on a single machine?
                            
                                How was the cor() function sped up?
                            
                                SQLite vs serializing to disk
                            
                                Optimize my performance
                            
                                MySQL slow group by/order by
                            
                                Faster way of counting total number of columns in a cassandra row with hector
                            
                                CSS Preprocessor or PHP?
                            
                                Performance of DataInputStream\DataOutputStream
                            
                                How to disable IE8 script error message?
                            
                                Is it costly in Python to put classes in different files?
                            
                                .NET Portable library missing BitConverter.DoubleToInt64Bits, replacement very slow
                            
                                ASP.NET startup Performance profiling web
                            
                                Finding the balance point in an array
                            
                                How can I approximate the size of a data structure in scala?
                            
                                Best way to read csv file in C# to improve time efficiency
                            
                                About Adaptive Mode for L1 Cache in Hyper-threading
                            
                                Canvas arc() vs drawImage()
                            
                                Pointer to struct or struct itself?
                            
                                How to Add Expires headers for amazon s3 images?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I optimize a loop which can be fully strict

Tags:

performance

haskell

micro-optimization

stusmith

People also ask

1 Answers

Daniel Fischer

Recent Activity

Donate For Us