Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retain Laziness With mapM

Tags:

haskell

monads

consider the following simple IO function:

req :: IO [Integer]
req = do
  print "x"
  return [1,2,3]

In reality this might be a http request, which returns a list after parsing it's result.

I'm trying to concatenate the results of several calls of that function in a lazy way.

In simple terms, the following should print the 'x' only two times:

fmap (take 4) req'
--> [1, 2, 3, 4]

I thought this might be solved with sequence or mapM, however my approach fails in terms of laziness:

import Control.Monad

req' :: IO [Integer]
req' = fmap concat $ mapM req [1..1000] -- should be infinite..

This yields the right result, however the IO function req is called 1000 times instead of the necessary 2 times. When implementing the above with a map over an infinite list, the evaluation does not terminate at all.

like image 287
Anton Harald Avatar asked Jan 23 '17 18:01

Anton Harald


3 Answers

Short version:

You shouldn't do this, look into a streaming IO library such as pipes or conduit instead.

Long version:

You can't. Or at least, you really shouldn't. Allowing lazily evaluated code to have side effects is generally a very bad idea. Not only does it very quickly become hard to reason about wich effects are performed when and how many times, but even worse, effects may not be performed in the order you expect them to! With pure code, this is not a big deal. With side-effecting code, this is a disaster.

Imagine that you want to read a value from a reference and then replace the value with an updated value. In the IO monad, where the order of computation is well defined, this is easy:

main = do
  yesterdaysDate <- readIORef ref
  writeIORef ref todaysDate

However, if the above code were instead to be evaluated lazily, there would be no guarantee that the reference was read before it was written - or even that both computations would be executed at all. The semantics of the program would depend entirely on if and when we needed the results of the computations. This is one of the reasons for coming up with monads in the first place: to give programmers a way to write code with side effects, which execute in a well-defined and easily understood order.

Now, it is actually possible to lazily concatenate the lists, if you create them using unsafeInterleaveIO:

import System.IO.Unsafe

req :: IO [Integer]
req = unsafeInterleaveIO $ do
    print "x"
    return [1,2,3]

req' :: IO [Integer]
req' = fmap concat $ mapM (const req) [1..1000]

This will cause each application of req to be deferred until the corresponding sublist is needed. However, lazily performing IO like this may lead to interesting race conditions and resource leaks, and is generally frowned upon. The recommended alternative would be to use a streaming IO library such as conduit or pipes, which are mentioned in the comments.

like image 61
valderman Avatar answered Nov 07 '22 02:11

valderman


Here is how you would do something like this with the streaming and pipes libraries. Pipes programs will be somewhat similar those written with conduit especially in this sort of case. conduit uses different names, and pipes and conduit have somewhat fancier types and operators than streaming; but it's really a matter of indifference which you use. streaming is I think fundamentally simpler in this sort of case; the formulation will be structurally similar to the corresponding IO [a] program and indeed frequently simpler. The essential point is that a Stream (Of Integer) IO () is exactly like list of Integers but it is built in that the elements of the list or stream can arise from successive IO actions.

I gave req an argument in the following, since that seemed to be what you had in mind.

import Streaming
import qualified Streaming.Prelude as S
import Streaming.Prelude (for, each)

req :: Integer -> Stream (Of Integer) IO ()
req x = do                   -- this 'stream' is just a list of Integers arising in IO
  liftIO $ putStr "Sending request #" >> print x
  each [x..x+2]               

req' :: Stream (Of Integer) IO ()
req' = for (S.each [1..]) req -- An infinite succession of requests 
                              -- each yielding three numbers. Here we are not 
                              -- actually using IO to get each but we could.
main = S.print $ S.take 4 req'
-- >>> main
-- Sending request #1
-- 1
-- 2
-- 3
-- Sending request #2
-- 2

To get our four desired values we had to send two "requests"; we of course don't end up applying req to all Integers! S.take doesn't permit any further development of the infinite stream req' it takes as argument; so only the first element from the second request is ever calculated. Then everything shuts down. The fancy signature Stream (Of Int) IO () could be replaced by a synonymn

 type List a = Stream (Of a) IO ()

and you would barely notice the difference from Haskell lists, except you don't get the apocalypses you noticed. The extra moveable parts in the actual signature are distracting here, but make it possible to replicate the whole API of Data.List in basically every detail while permitting IO and avoidance of accumulation everywhere. (Without the further moveable parts it is e.g. impossible to write splitAt, partition and chunksOf, and indeed you will find stack overflow is awash with questions how to do these obvious things with e.g. conduit.)

The pipes equivalent is this

import Pipes
import qualified Pipes.Prelude as P

req :: Integer -> Producer Integer IO ()
req x = do
  liftIO $ putStr "Sending request #" >> print x
  each [x..x+2]

req' = for (each [1..]) req

main = runEffect $ req' >-> P.take 4 >->  P.print 
-- >>> main
-- Sending request #1
-- 1
-- 2
-- 3
-- Sending request #2
-- 2

it differs by treating take and print as pipes, rather than as ordinary functions on streams as they are with Data.List. This has charm but is not needed in the present context where the conception of the stream as an effectful list predominates. Intuitively takeing and printing are things we do to a list, even if it is an effectful list as in this case, and the piping and conduiting aspect is a distraction (in bread-and-butter cases it also nearly doubles the time needed for calculation, due to the cost of >-> and .| which is akin to that of say map.)

It might help understanding if we note that req above could have been written

req x = do
  liftIO $ putStr "Sending request #" >> print x
  yield x      -- yield a >> yield b == each [a,b]
  yield (x+1)
  yield (x+2)  

this will be word for word the same in streaming pipes and conduit. yield a >> rest is the same as a:rest The difference is that a yield a line (in a do block) can be preceded by a bit of IO, e.g. a <- liftIO readLn; yield a

In general list mapM replicateM traverse and sequence should be avoided - except for short lists - for the reasons you mention. sequence is at the bottom of them all and it basically has to constitute the whole list before it can proceed. (Note sequence = mapM id; mapM f = sequence . map f) Thus we see

 >>> sequence [getChar,getChar,getChar] >>= mapM_ print
 abc'a'   -- here and below I just type abc, ghci prints 'a' 'b' 'c'
 'b'
 'c'

but with a streaming library we see stuff like

 >>> S.mapM_ print $ S.sequence $ S.each [getChar,getChar,getChar] 
 a'a'
 b'b'
 c'c'

Similarly

>>> replicateM 3 getChar >>= mapM_ print
abc'a'
'b'
'c'

is a mess - nothing happens till the whole list is constructed, then each of the collected Chars is printed in succession. But with a streaming library we write the simpler

>>> S.mapM_ print $ S.replicateM 3 getChar
a'a'
b'b'
c'c'

and the outputs are in sync with the inputs. In particular, no more than one character is in memory at a time. replicateM_, mapM_ and sequence_ by contrast don't accumulate lists aren't a problem. It's the others that should prompt one to think of a streaming library, any streaming library. A monad-general sequence can't do any better than this, as you can see by reflecting on

>>> sequence [Just 1, Just 2, Just 3]
Just [1,2,3]
>>> sequence [Just 1, Just 2, Nothing]
Nothing

If the list were a million Maybe Ints long, it would all have to be remembered and left unused while waiting to see if last item is Nothing. Since sequence, mapM, replicateM, traverse and company are monad general, what goes for Maybe goes for IO.

Continuing above, we can similarly collect the list as you seemed to want to do:

main = S.toList_ (S.take 4 req') >>= print 
-- >>> main
-- Sending request #1
-- Sending request #2
-- [1,2,3,2]

or, in the pipes version:

main = P.toListM (req' >->  P.take 4) >>= print  
-- >>> main
-- Sending request #1
-- Sending request #2
-- [1,2,3,2]

Or to pile on possibilities, we can do IO with each element, while collecting them in a list or vector or whatever

main = do
  ls <- S.toList_ $ S.print $ S.copy $ S.take 4 req'
  print ls

-- >>> main
-- Sending request #1
-- 1
-- 2
-- 3
-- Sending request #2
-- 2
-- [1,2,3,2]

Here I print the copies and save the 'originals' for a list. The games we are playing here start to come upon the limits of pipes and conduit, though this particular program can be replicated with them.

like image 45
Michael Avatar answered Nov 07 '22 01:11

Michael


As far as I know, what you're looking for shouldn't/can't be done using mapM and should probably use some form of streaming. In case it's helpful, an example using io-streams:

import qualified System.IO.Streams as Streams
import qualified System.IO.Streams.Combinators as Streams

req :: IO (Maybe [Integer])
req = do
  print "x"
  return (Just [1,2,3])

req' :: IO [Integer]
req' = Streams.toList =<< Streams.take 4 =<< Streams.concatLists =<< Streams.makeInputStream req
like image 1
ryachza Avatar answered Nov 07 '22 01:11

ryachza