This piece of Haskell code runs much slower with <code>-O</code>, but <code>-O</code> should be non-dangerous. Can anyone tell me what happened? If it matters, it is an attempt to solve this problem, and it uses binary search and persistent segment tree: <pre class="prettyprint"><code>import Control.Monad import Data.Array data Node = Leaf Int -- value | Branch Int Node Node -- sum, left child, right child type NodeArray = Array Int Node -- create an empty node with range [l, r) create :: Int -> Int -> Node create l r | l + 1 == r = Leaf 0 | otherwise = Branch 0 (create l m) (create m r) where m = (l + r) `div` 2 -- Get the sum in range [0, r). The range of the node is [nl, nr) sumof :: Node -> Int -> Int -> Int -> Int sumof (Leaf val) r nl nr | nr <= r = val | otherwise = 0 sumof (Branch sum lc rc) r nl nr | nr <= r = sum | r > nl = (sumof lc r nl m) + (sumof rc r m nr) | otherwise = 0 where m = (nl + nr) `div` 2 -- Increase the value at x by 1. The range of the node is [nl, nr) increase :: Node -> Int -> Int -> Int -> Node increase (Leaf val) x nl nr = Leaf (val + 1) increase (Branch sum lc rc) x nl nr | x < m = Branch (sum + 1) (increase lc x nl m) rc | otherwise = Branch (sum + 1) lc (increase rc x m nr) where m = (nl + nr) `div` 2 -- signature said it all tonodes :: Int -> [Int] -> [Node] tonodes n = reverse . tonodes' . reverse where tonodes' :: [Int] -> [Node] tonodes' (h:t) = increase h' h 0 n : s' where s'@(h':_) = tonodes' t tonodes' _ = [create 0 n] -- find the minimum m in [l, r] such that (predicate m) is True binarysearch :: (Int -> Bool) -> Int -> Int -> Int binarysearch predicate l r | l == r = r | predicate m = binarysearch predicate l m | otherwise = binarysearch predicate (m+1) r where m = (l + r) `div` 2 -- main, literally main :: IO () main = do [n, m] <- fmap (map read . words) getLine nodes <- fmap (listArray (0, n) . tonodes n . map (subtract 1) . map read . words) getLine replicateM_ m $ query n nodes where query :: Int -> NodeArray -> IO () query n nodes = do [p, k] <- fmap (map read . words) getLine print $ binarysearch (ok nodes n p k) 0 n where ok :: NodeArray -> Int -> Int -> Int -> Int -> Bool ok nodes n p k s = (sumof (nodes ! min (p + s + 1) n) s 0 n) - (sumof (nodes ! max (p - s) 0) s 0 n) >= k </code></pre> (This is exactly the same code with code review but this question addresses another problem.) This is my input generator in C++: <pre class="prettyprint lang-cpp prettyprint-override"><code>#include <cstdio> #include <cstdlib> using namespace std; int main (int argc, char * argv[]) { srand(1827); int n = 100000; if(argc > 1) sscanf(argv[1], "%d", &n); printf("%d %d\n", n, n); for(int i = 0; i < n; i++) printf("%d%c", rand() % n + 1, i == n - 1 ? '\n' : ' '); for(int i = 0; i < n; i++) { int p = rand() % n; int k = rand() % n + 1; printf("%d %d\n", p, k); } } </code></pre> In case you don't have a C++ compiler available, this is the result of <code>./gen.exe 1000</code>. This is the execution result on my computer: <pre class="prettyprint lang-none prettyprint-override"><code>$ ghc --version The Glorious Glasgow Haskell Compilation System, version 7.8.3 $ ghc -fforce-recomp 1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ time ./gen.exe 1000 | ./1827.exe > /dev/null real 0m0.088s user 0m0.015s sys 0m0.015s $ ghc -fforce-recomp -O 1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ time ./gen.exe 1000 | ./1827.exe > /dev/null real 0m2.969s user 0m0.000s sys 0m0.045s </code></pre> And this is the heap profile summary: <pre class="prettyprint lang-none prettyprint-override"><code>$ ghc -fforce-recomp -rtsopts ./1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null 70,207,096 bytes allocated in the heap 2,112,416 bytes copied during GC 613,368 bytes maximum residency (3 sample(s)) 28,816 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 132 colls, 0 par 0.00s 0.00s 0.0000s 0.0004s Gen 1 3 colls, 0 par 0.00s 0.00s 0.0006s 0.0010s INIT time 0.00s ( 0.00s elapsed) MUT time 0.03s ( 0.03s elapsed) GC time 0.00s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.03s ( 0.04s elapsed) %GC time 0.0% (14.7% elapsed) Alloc rate 2,250,213,011 bytes per MUT second Productivity 100.0% of total user, 83.1% of total elapsed $ ghc -fforce-recomp -O -rtsopts ./1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null 6,009,233,608 bytes allocated in the heap 622,682,200 bytes copied during GC 443,240 bytes maximum residency (505 sample(s)) 48,256 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10945 colls, 0 par 0.72s 0.63s 0.0001s 0.0004s Gen 1 505 colls, 0 par 0.16s 0.13s 0.0003s 0.0005s INIT time 0.00s ( 0.00s elapsed) MUT time 2.00s ( 2.13s elapsed) GC time 0.87s ( 0.76s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.89s ( 2.90s elapsed) %GC time 30.3% (26.4% elapsed) Alloc rate 3,009,412,603 bytes per MUT second Productivity 69.7% of total user, 69.4% of total elapsed </code></pre>

<h3>What happened to your code with <code>-O</code> </h3> Let me zoom in your main function, and rewrite it slightly: <pre class="prettyprint"><code>main :: IO () main = do [n, m] <- fmap (map read . words) getLine line <- getLine let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line replicateM_ m $ query n nodes </code></pre> Clearly, the intention here is that the <code>NodeArray</code> is created once, and then used in every of the <code>m</code> invocations of <code>query</code>. Unfortunately, GHC transforms this code to, effectively, <pre class="prettyprint"><code>main = do [n, m] <- fmap (map read . words) getLine line <- getLine replicateM_ m $ do let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line query n nodes </code></pre> and you can immediately see the problem here. <h3>What is the state hack, and why does it destroy my programs performance</h3> The reason is the state hack, which says (roughly): “When something is of type <code>IO a</code>, assume it is called only once.”. The official documentation is not much more elaborate: <blockquote> <code>-fno-state-hack</code> Turn off the "state hack" whereby any lambda with a State# token as argument is considered to be single-entry, hence it is considered OK to inline things inside it. This can improve performance of IO and ST monad code, but it runs the risk of reducing sharing. </blockquote> Roughly, the idea is as follows: If you define a function with an <code>IO</code> type and a where clause, e.g. <pre class="prettyprint"><code>foo x = do putStrLn y putStrLn y where y = ...x... </code></pre> Something of type <code>IO a</code> can be viewed as something of type <code>RealWord -> (a, RealWorld)</code>. In that view, the above becomes (roughly) <pre class="prettyprint"><code>foo x = let y = ...x... in \world1 -> let (world2, ()) = putStrLn y world1 let (world3, ()) = putStrLn y world2 in (world3, ()) </code></pre> A call to <code>foo</code> would (typically) look like this <code>foo argument world</code>. But the definition of <code>foo</code> only takes one argument, and the other one is only consumed later by a local lambda expression! That is going to be a very slow call to <code>foo</code>. It would be much faster if the code would look like this: <pre class="prettyprint"><code>foo x world1 = let y = ...x... in let (world2, ()) = putStrLn y world1 let (world3, ()) = putStrLn y world2 in (world3, ()) </code></pre> This is called eta-expansion and done on various grounds (e.g. by analyzing the function’s definition, by checking how it is being called, and – in this case – type directed heuristics). Unfortunately, this degrades performance if the call to <code>foo</code> is actually of the form <code>let fooArgument = foo argument</code>, i.e. with an argument, but no <code>world</code> passed (yet). In the original code, if <code>fooArgument</code> is then used several times, <code>y</code> will still be calculated only once, and shared. In the modified code, <code>y</code> will be re-calculated every time – precisely what has happened to your <code>nodes</code>. <h3>Can things be fixed?</h3> Possibly. See #9388 for an attempt at doing so. The problem with fixing it is that it will cost performance in a lot of cases where the transformation happens to ok, even though the compiler cannot possibly know that for sure. And there are probably cases where it is technically not ok, i.e. sharing is lost, but it is still beneficial because the speedups from the faster calling outweigh the extra cost of the recalculation. So it is not clear where to go from here.

Why does this Haskell code run slower with -O?

Tags:

optimization

haskell

ghc

compiler-bug

This piece of Haskell code runs much slower with -O, but -O should be non-dangerous. Can anyone tell me what happened? If it matters, it is an attempt to solve this problem, and it uses binary search and persistent segment tree:

import Control.Monad import Data.Array  data Node =       Leaf   Int           -- value     | Branch Int Node Node -- sum, left child, right child type NodeArray = Array Int Node  -- create an empty node with range [l, r) create :: Int -> Int -> Node create l r     | l + 1 == r = Leaf 0     | otherwise  = Branch 0 (create l m) (create m r)     where m = (l + r) `div` 2  -- Get the sum in range [0, r). The range of the node is [nl, nr) sumof :: Node -> Int -> Int -> Int -> Int sumof (Leaf val) r nl nr     | nr <= r   = val     | otherwise = 0 sumof (Branch sum lc rc) r nl nr     | nr <= r   = sum     | r  > nl   = (sumof lc r nl m) + (sumof rc r m nr)     | otherwise = 0     where m = (nl + nr) `div` 2  -- Increase the value at x by 1. The range of the node is [nl, nr) increase :: Node -> Int -> Int -> Int -> Node increase (Leaf val) x nl nr = Leaf (val + 1) increase (Branch sum lc rc) x nl nr     | x < m     = Branch (sum + 1) (increase lc x nl m) rc     | otherwise = Branch (sum + 1) lc (increase rc x m nr)     where m = (nl + nr) `div` 2  -- signature said it all tonodes :: Int -> [Int] -> [Node] tonodes n = reverse . tonodes' . reverse     where         tonodes' :: [Int] -> [Node]         tonodes' (h:t) = increase h' h 0 n : s' where s'@(h':_) = tonodes' t         tonodes' _ = [create 0 n]  -- find the minimum m in [l, r] such that (predicate m) is True binarysearch :: (Int -> Bool) -> Int -> Int -> Int binarysearch predicate l r     | l == r      = r     | predicate m = binarysearch predicate l m     | otherwise   = binarysearch predicate (m+1) r     where m = (l + r) `div` 2  -- main, literally main :: IO () main = do     [n, m] <- fmap (map read . words) getLine     nodes <- fmap (listArray (0, n) . tonodes n . map (subtract 1) . map read . words) getLine     replicateM_ m $ query n nodes     where         query :: Int -> NodeArray -> IO ()         query n nodes = do             [p, k] <- fmap (map read . words) getLine             print $ binarysearch (ok nodes n p k) 0 n             where                 ok :: NodeArray -> Int -> Int -> Int -> Int -> Bool                 ok nodes n p k s = (sumof (nodes ! min (p + s + 1) n) s 0 n) - (sumof (nodes ! max (p - s) 0) s 0 n) >= k

(This is exactly the same code with code review but this question addresses another problem.)

This is my input generator in C++:

#include <cstdio> #include <cstdlib> using namespace std; int main (int argc, char * argv[]) {     srand(1827);     int n = 100000;     if(argc > 1)         sscanf(argv[1], "%d", &n);     printf("%d %d\n", n, n);     for(int i = 0; i < n; i++)         printf("%d%c", rand() % n + 1, i == n - 1 ? '\n' : ' ');     for(int i = 0; i < n; i++) {         int p = rand() % n;         int k = rand() % n + 1;         printf("%d %d\n", p, k);     } }

In case you don't have a C++ compiler available, this is the result of ./gen.exe 1000.

This is the execution result on my computer:

$ ghc --version The Glorious Glasgow Haskell Compilation System, version 7.8.3 $ ghc -fforce-recomp 1827.hs [1 of 1] Compiling Main             ( 1827.hs, 1827.o ) Linking 1827.exe ... $ time ./gen.exe 1000 | ./1827.exe > /dev/null real    0m0.088s user    0m0.015s sys     0m0.015s $ ghc -fforce-recomp -O 1827.hs [1 of 1] Compiling Main             ( 1827.hs, 1827.o ) Linking 1827.exe ... $ time ./gen.exe 1000 | ./1827.exe > /dev/null real    0m2.969s user    0m0.000s sys     0m0.045s

And this is the heap profile summary:

$ ghc -fforce-recomp -rtsopts ./1827.hs [1 of 1] Compiling Main             ( 1827.hs, 1827.o ) Linking 1827.exe ... $ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null       70,207,096 bytes allocated in the heap        2,112,416 bytes copied during GC          613,368 bytes maximum residency (3 sample(s))           28,816 bytes maximum slop                3 MB total memory in use (0 MB lost due to fragmentation)                                     Tot time (elapsed)  Avg pause  Max pause   Gen  0       132 colls,     0 par    0.00s    0.00s     0.0000s    0.0004s   Gen  1         3 colls,     0 par    0.00s    0.00s     0.0006s    0.0010s   INIT    time    0.00s  (  0.00s elapsed)   MUT     time    0.03s  (  0.03s elapsed)   GC      time    0.00s  (  0.01s elapsed)   EXIT    time    0.00s  (  0.00s elapsed)   Total   time    0.03s  (  0.04s elapsed)   %GC     time       0.0%  (14.7% elapsed)   Alloc rate    2,250,213,011 bytes per MUT second   Productivity 100.0% of total user, 83.1% of total elapsed $ ghc -fforce-recomp -O -rtsopts ./1827.hs [1 of 1] Compiling Main             ( 1827.hs, 1827.o ) Linking 1827.exe ... $ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null    6,009,233,608 bytes allocated in the heap      622,682,200 bytes copied during GC          443,240 bytes maximum residency (505 sample(s))           48,256 bytes maximum slop                3 MB total memory in use (0 MB lost due to fragmentation)                                     Tot time (elapsed)  Avg pause  Max pause   Gen  0     10945 colls,     0 par    0.72s    0.63s     0.0001s    0.0004s   Gen  1       505 colls,     0 par    0.16s    0.13s     0.0003s    0.0005s   INIT    time    0.00s  (  0.00s elapsed)   MUT     time    2.00s  (  2.13s elapsed)   GC      time    0.87s  (  0.76s elapsed)   EXIT    time    0.00s  (  0.00s elapsed)   Total   time    2.89s  (  2.90s elapsed)   %GC     time      30.3%  (26.4% elapsed)   Alloc rate    3,009,412,603 bytes per MUT second   Productivity  69.7% of total user, 69.4% of total elapsed

936

asked Apr 02 '15 02:04

johnchen902

1 Answers

What happened to your code with `-O`

Let me zoom in your main function, and rewrite it slightly:

main :: IO () main = do     [n, m] <- fmap (map read . words) getLine     line <- getLine     let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line     replicateM_ m $ query n nodes

Clearly, the intention here is that the NodeArray is created once, and then used in every of the m invocations of query.

Unfortunately, GHC transforms this code to, effectively,

main = do     [n, m] <- fmap (map read . words) getLine     line <- getLine     replicateM_ m $ do         let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line         query n nodes

and you can immediately see the problem here.

What is the state hack, and why does it destroy my programs performance

The reason is the state hack, which says (roughly): “When something is of type IO a, assume it is called only once.”. The official documentation is not much more elaborate:

-fno-state-hack

Turn off the "state hack" whereby any lambda with a State# token as argument is considered to be single-entry, hence it is considered OK to inline things inside it. This can improve performance of IO and ST monad code, but it runs the risk of reducing sharing.

Roughly, the idea is as follows: If you define a function with an IO type and a where clause, e.g.

foo x = do     putStrLn y     putStrLn y   where y = ...x...

Something of type IO a can be viewed as something of type RealWord -> (a, RealWorld). In that view, the above becomes (roughly)

foo x =     let y = ...x... in     \world1 ->      let (world2, ()) = putStrLn y world1      let (world3, ()) = putStrLn y world2      in  (world3, ())

A call to foo would (typically) look like this foo argument world. But the definition of foo only takes one argument, and the other one is only consumed later by a local lambda expression! That is going to be a very slow call to foo. It would be much faster if the code would look like this:

foo x world1 =     let y = ...x... in     let (world2, ()) = putStrLn y world1    let (world3, ()) = putStrLn y world2    in  (world3, ())

This is called eta-expansion and done on various grounds (e.g. by analyzing the function’s definition, by checking how it is being called, and – in this case – type directed heuristics).

Unfortunately, this degrades performance if the call to foo is actually of the form let fooArgument = foo argument, i.e. with an argument, but no world passed (yet). In the original code, if fooArgument is then used several times, y will still be calculated only once, and shared. In the modified code, y will be re-calculated every time – precisely what has happened to your nodes.

Can things be fixed?

Possibly. See #9388 for an attempt at doing so. The problem with fixing it is that it will cost performance in a lot of cases where the transformation happens to ok, even though the compiler cannot possibly know that for sure. And there are probably cases where it is technically not ok, i.e. sharing is lost, but it is still beneficial because the speedups from the faster calling outweigh the extra cost of the recalculation. So it is not clear where to go from here.

answered Oct 22 '22 23:10

Joachim Breitner

Related questions
                            
                                How to provide explicit type declarations for functions when using GHCi?
                            
                                Haskell types frustrating a simple 'average' function
                            
                                What is Levity polymorphism
                            
                                What do "reify" and "reification" mean in the context of (functional?) programming?
                            
                                Haskell on JVM?
                            
                                Haskell: lift vs liftIO
                            
                                How does foldr work?
                            
                                How to read this GHC Core "proof"?
                            
                                Data.Text vs String
                            
                                Concrete example showing that monads are not closed under composition (with proof)?
                            
                                How can I uninstall a version of a Cabal package?
                            
                                What language to learn after Haskell? [closed]
                            
                                How can I get esqueleto to generate an SQL string for me?
                            
                                Why does this Haskell code produce the "infinite type" error?
                            
                                Recursion schemes for dummies?
                            
                                Why is Haskell (sometimes) referred to as "Best Imperative Language"?
                            
                                Haskell "do nothing" IO, or if without else
                            
                                What is the combinatory logic equivalent of intuitionistic type theory?
                            
                                How do exceptions in Haskell work?
                            
                                Orphaned instances in Haskell

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does this Haskell code run slower with -O?

Tags:

optimization

haskell

ghc

compiler-bug

johnchen902

People also ask

1 Answers

What happened to your code with `-O`

What is the state hack, and why does it destroy my programs performance

Can things be fixed?

Joachim Breitner

Recent Activity

Donate For Us

Why does this Haskell code run slower with -O?

Tags:

optimization

haskell

ghc

compiler-bug

johnchen902

People also ask

1 Answers

What happened to your code with -O

What is the state hack, and why does it destroy my programs performance

Can things be fixed?

Joachim Breitner

Related questions

Recent Activity

Donate For Us

What happened to your code with `-O`