Here's my implementation of a sort of treap (with implicit keys and some additional information stored in nodes): http://hpaste.org/42839/treap_with_implicit_keys According to profiling data GC takes 80% of time for this program. As far as I understand, it's caused by the fact that every time a node is 'modified', each node on the path to the root is recreated. Is there something I can do here to improve performance or I have to descend into the realm of ST monad?

Using GHC 7.0.3, I can reproduce your heavy GC behavior: <pre class="prettyprint"><code> $ time ./A +RTS -s %GC time 92.9% (92.9% elapsed) ./A +RTS -s 7.24s user 0.04s system 99% cpu 7.301 total </code></pre> I spent 10 minutes going through the program. Here's what I did, in order: <ul> <li>Set GHC's -H flag, increasing limits in the GC</li> <li>Check unpacking </li> <li>Improve inlining</li> <li>Adjust the first generation allocation area</li> </ul> Resulting in a 10 fold speedup, and GC around 45% of time. <hr> In order, using GHC's magic <code>-H</code> flag, we can reduce that runtime quite a bit: <pre class="prettyprint"><code> $ time ./A +RTS -s -H %GC time 74.3% (75.3% elapsed) ./A +RTS -s -H 2.34s user 0.04s system 99% cpu 2.392 total </code></pre> Not bad! The UNPACK pragmas on the <code>Tree</code> nodes won't do anything, so remove those. Inlining <code>update</code> shaves off more runtime: <pre class="prettyprint"><code> ./A +RTS -s -H 1.84s user 0.04s system 99% cpu 1.883 total </code></pre> as does inlining <code>height</code> <pre class="prettyprint"><code> ./A +RTS -s -H 1.74s user 0.03s system 99% cpu 1.777 total </code></pre> So while it is fast, GC is still dominating -- since we're testing allocation, after all. One thing we can do is increase the first gen size: <pre class="prettyprint"><code> $ time ./A +RTS -s -A200M %GC time 45.1% (40.5% elapsed) ./A +RTS -s -A200M 0.71s user 0.16s system 99% cpu 0.872 total </code></pre> And increasing the unfolding threshold, as JohnL suggested, helps a little, <pre class="prettyprint"><code> ./A +RTS -s -A100M 0.74s user 0.09s system 99% cpu 0.826 total </code></pre> which is what, 10x faster than we started? Not bad. <hr> Using ghc-gc-tune, you can see runtime as a function of <code>-A</code> and <code>-H</code>, <img src="https://i.stack.imgur.com/r9Iv2.png" alt="Time and GC"> Interestingly, the best running times use very large <code>-A</code> values, e.g. <pre class="prettyprint"><code>$ time ./A +RTS -A500M ./A +RTS -A500M 0.49s user 0.28s system 99% cpu 0.776s </code></pre>

Improving treap implementation

1 Answers

Using GHC 7.0.3, I can reproduce your heavy GC behavior:

  $ time ./A +RTS -s
  %GC time      92.9%  (92.9% elapsed)
  ./A +RTS -s  7.24s user 0.04s system 99% cpu 7.301 total

I spent 10 minutes going through the program. Here's what I did, in order:

Set GHC's -H flag, increasing limits in the GC
Check unpacking
Improve inlining
Adjust the first generation allocation area

Resulting in a 10 fold speedup, and GC around 45% of time.

In order, using GHC's magic -H flag, we can reduce that runtime quite a bit:

  $ time ./A +RTS -s -H
  %GC time      74.3%  (75.3% elapsed)
  ./A +RTS -s -H  2.34s user 0.04s system 99% cpu 2.392 total

Not bad!

The UNPACK pragmas on the Tree nodes won't do anything, so remove those.

Inlining update shaves off more runtime:

 ./A +RTS -s -H  1.84s user 0.04s system 99% cpu 1.883 total

as does inlining height

 ./A +RTS -s -H  1.74s user 0.03s system 99% cpu 1.777 total

So while it is fast, GC is still dominating -- since we're testing allocation, after all. One thing we can do is increase the first gen size:

 $ time ./A +RTS -s -A200M
 %GC time      45.1%  (40.5% elapsed)
 ./A +RTS -s -A200M  0.71s user 0.16s system 99% cpu 0.872 total

And increasing the unfolding threshold, as JohnL suggested, helps a little,

 ./A +RTS -s -A100M  0.74s user 0.09s system 99% cpu 0.826 total

which is what, 10x faster than we started? Not bad.

Using ghc-gc-tune, you can see runtime as a function of -A and -H,

Time and GC

Interestingly, the best running times use very large -A values, e.g.

$ time ./A +RTS -A500M   
./A +RTS -A500M  0.49s user 0.28s system 99% cpu 0.776s

145

answered Oct 29 '22 06:10

Don Stewart

Related questions
                            
                                How to freeze web browser's repaints while changing visibility of elements?
                            
                                Eclipse Juno + WTP +EGit dead slow
                            
                                Memory efficient sort of massive numpy array in Python
                            
                                Better page load performance when loading multiple embedded Youtube videos?
                            
                                Faster to access numeric property by string or integer?
                            
                                store TEXT/BLOB in same table or not?
                            
                                Using Java 7 HashMap in Java 8
                            
                                How to improve performance of UCollectionView containing lots of small images?
                            
                                How does overflow: hidden; & border-radius on a container cause massive slowdowns to "Paint / Render Layer" within container, only on IE?
                            
                                D3 force layout visualization dead slow when using a large dataset?
                            
                                Scala: Mutable vs. Immutable Object Performance - OutOfMemoryError
                            
                                Precise time measurement in Java
                            
                                How can "set timestamp" be a slow query?
                            
                                Performance difference between returning a value directly or creating a temporary variable
                            
                                Why isn't std::list.size() constant-time? [duplicate]
                            
                                Why push method is significantly slower than putting values via array indices in Javascript
                            
                                Is it possible for a website to heat up a android Device and cause faster battery draining? [closed]
                            
                                ObjectMapper - Best practice for thread-safety and performance
                            
                                Resources for high performance computing in C++ [closed]
                            
                                SQL Server - Management Studio - Client Statistics - Wait time on server replies vs Client processing time

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Improving treap implementation

Tags:

performance

optimization

data-structures

garbage-collection

haskell

adamax

People also ask

1 Answers

Don Stewart

Recent Activity

Donate For Us