Why is replicateM (length xs) m way more efficient than sequenceA (fmap (const m) xs)?

Tags:

My two submissions for a programming problem differ in just one expression (where anchors is a nonempty list and (getIntegrals n) is a state monad):

Submission 1. replicateM (length anchors - 1) (getIntegrals n)

Submission 2. sequenceA $ const (getIntegrals n) <$> tail anchors

The two expressions' equivalence should be easy to see at compile time itself, I guess. And yet, comparatively the sequenceA one is slower, and more importantly, takes up >10x memory:

Code	Time	Memory
replicateM one	732 ms	22200 KB
sequenceA one	1435 ms	262100 KB

^{(with "Memory limit exceeded on test 4" error for the second entry, so it might be even worse).}

Why is it so?

It is becoming quite hard to predict which optimizations are automatic and which are not!

EDIT: As suggested, pasting Submission 1 code below. In this interactive problem, the 'server' has a hidden tree of size n. Our code's job is to find out that tree, with minimal number of queries of the form ? k. Loosely speaking, the server's response to ? k is the row corresponding to node k in the ~~adjacency~~ distance matrix of the tree. Our choices of k are: initially 1, and then a bunch of nodes obtained from getAnchors.

{-# LANGUAGE Safe #-}
{-# OPTIONS_GHC -O2 #-}

import Data.Maybe
import qualified Data.ByteString.Lazy.Char8 as B8
import qualified Data.ByteString.Builder as Bu
import Data.Functor.Identity
import Control.Monad.Trans.State
import Control.Monad
import Control.Applicative
import Data.ByteString.Builder.Extra (flush) 
import System.IO

type St = StateT [B8.ByteString] Identity

solve :: St Bu.Builder
solve = do
  n <- getIntegral
  ds <- getIntegrals n  -- get the first row of adjacency matrix
  let
    anchors = getAnchors ds
    readFirst = if head anchors==1 then return ds else getIntegrals n
    readRest = replicateM (length anchors - 1) (getIntegrals n) -- get some other rows too
  adjss <- liftA2 (:) readFirst readRest  
  let
    adj1ss = [map snd $ filter ((1==).fst) (zip adjs [1..]) | adjs <- adjss]
    s0 = Bu.string7
    snl = Bu.string7 "\n" <> flush
    i0 = Bu.intDec
    printEdge src dst = i0 src <> s0 " " <> i0 dst <> snl
    printAdj (src,dsts) = mconcat [printEdge src dst | dst<-dsts]
    printAdjs = mconcat $ printAdj <$> zip anchors adj1ss
    ask k = s0 "? " <> i0 k <> snl
    askRest = mconcat $ ask <$> (dropWhile (==1) anchors)
  return $ ask 1 <> askRest <> s0 "!" <> snl <> printAdjs

getAnchors :: [Int]->[Int]
getAnchors xs = reverse $ go (zip xs [1..]) [] [] where
  go [] odds evens = if length odds < length evens then odds else evens
  go ((k,i):rest) odds evens
    | even k = go rest odds (i: evens)
    | odd k = go rest (i: odds) evens
 
getByteString :: St B8.ByteString
getByteString = state getNext where
  getNext [] =  (B8.take 0 (B8.pack "."),[])
  getNext (w:ws) =  (w,ws)
 
getIntegral :: Num t => St t
getIntegral  = convertToNum <$> getByteString where
  convertToNum x =  fromIntegral $ fromMaybe 0 $ liftA fst $ B8.readInteger x
 
getIntegrals :: Num t => Int -> St [t]
getIntegrals n = replicateM n getIntegral

main :: IO ()
main = do
  hSetBuffering stdout NoBuffering
  bytestrings <- B8.words <$> B8.getContents
  B8.putStr $ Bu.toLazyByteString $ evalState solve bytestrings

700

asked Nov 08 '21 13:11

cobra

1 Answers

The problem here is related to inlining. I do not understand it completly, but here is what I understand.

Inlining

First we find that copy&pasting the definition of replicateM into the Submission 1 yields the same bad performance as Submission 2 (submission). However if we replace the INLINABLE pragma of replicateM with a NOINLINE pragma things work again (submission).

The INLINABLE pragma on replicateM is different from an INLINE pragma, the latter leading to more inlining than the former. Specifically here if we define replicateM in the same file Haskells heuristic for inlining decides to inline, but with replicateM from base it decides against inlining in this case even in the presence of the INLINABLE pragma.

sequenceA and traverse on the other hand both have INLINE pragmas leading to inlining. Taking a hint from the above experiment we can define a non-inlinable sequenceA and indead this makes Solution 2 work (submission).

{-# NOINLINE sequenceA' #-}
sequenceA' :: [St x] -> St [x]
sequenceA' = sequenceA

What is going wrong?

The following is some pretty severe speculation on my part.

But how does inlining cause problems? Well let's look at the difference between the following two core dumps

With inlining:

Without inlining:

Here we're looking both times at what corresponds to, in the first instance the inlined part and the second instance the actual call to replicateM.

readRest = replicateM (length anchors - 1) (getIntegrals n)

Now the interesting bit is that in the inlined code the yellow highlighted lines are run in every loop of replicateM, while in the non-inlined part they are calculated once, outside the lambda abstraction which is passed to replicateM.

But what are they doing? There are multiple variables called ds in the core, but this one refers to this:

enter image description here

which in turn corresponds to

solve = do
  n <- getIntegral

So what I think is happening is that instead of running getIntegral once and saving the result, it's starting state is saved and it is rerun with this state for every pass of the loop. Indeed changing this line to the following (requires BangPatterns language extension) fixes all versions (submission).

solve = do
  !n <- getIntegral

I'm still not really sure, but this is my best guess.

Here are the two core dumps for reference: Inline, Noinline

This is crazy

Well yes, but I feel that the underlying problem here is your use of lazy IO and lazy State. Using the strict State transformer Haskell probably would have figured out to not keep old state around (I have no idea, just a guess), however we can not use strict State here, because of your reliance on lazy IO, i.e. getting all the input at the beginning using getContents and lazyly forcing it while making sure to provide output before forcing too much. Instead it would be a lot safer to explicitely read the input line by line. I.e. replace the StateT [ByteString] with IO or something more fancy like a Conduit or Pipe.

128

answered Oct 12 '22 13:10

Julia Path

Related questions
                            
                                Manual indexing over jagged arrays?
                            
                                Java TCP/IP Socket write performance optimization
                            
                                Oracle performance: query executing multiple identical function calls
                            
                                Matlab's bsxfun() - what explains the performance differences when expanding along different dimensions?
                            
                                Objective-C method swizzling performance
                            
                                Using valgrind to measure cache misses [closed]
                            
                                Julia: optimize simulation of simple dynamical system
                            
                                AdView slows down whole app, possible reason
                            
                                Workaround for the changed jdk6 to jdk7 behavior of substring
                            
                                Shortest Path between two Trie Nodes
                            
                                What are the performance characteristics between curried, partially applied, and 'normal' functions in Scala?
                            
                                Why is Ruby's loop command slower than while true?
                            
                                Jump back some iterations for vectorized remainder loop
                            
                                Garbage Collector not freeing "trash memory" as it should in an Android application
                            
                                Are write-combining buffers used for normal writes to WB memory regions on Intel?
                            
                                How to get test initialize in memory and use in each test
                            
                                Execute Subquery refactoring first before any other SQL
                            
                                fastest way to convert two-bit number to low-memory representation
                            
                                Speeding up data insertion from pandas dataframe to mysql
                            
                                Is there a way to further improve sparse solution times using python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is replicateM (length xs) m way more efficient than sequenceA (fmap (const m) xs)?

Tags:

performance

optimization

memory

haskell

monads