Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Space explosion when folding over Producers/Parsers in Haskell

Supposing I have a module like this:

module Explosion where

import Pipes.Parse (foldAll, Parser, Producer)
import Pipes.ByteString (ByteString, fromLazy)
import Pipes.Aeson (DecodingError)
import Pipes.Aeson.Unchecked (decoded)
import Data.List (intercalate)
import Data.ByteString.Lazy.Char8 (pack)
import Lens.Family (view)
import Lens.Family.State.Strict (zoom)

produceString :: Producer ByteString IO ()
produceString = fromLazy $ pack $ intercalate " " $ map show [1..1000000]

produceInts :: 
    Producer Int IO (Either (DecodingError, Producer ByteString IO ()) ())
produceInts = view decoded produceString

produceInts' :: Producer Int IO ()
produceInts' = produceInts >> return ()

parseBiggest :: Parser ByteString IO Int
parseBiggest = zoom decoded (foldAll max 0 id)

The 'produceString' function is a bytestring producer, and I am concerned with folding a parse over it to produce some kind of result.

The following two programs show different ways of tackling the problem of finding the maximum value in the bytestring by parsing it as a series of JSON ints.

Program 1:

module Main where

import Explosion (produceInts')
import Pipes.Prelude (fold)

main :: IO ()
main = do
    biggest <- fold max 0 id produceInts'
    print $ show biggest

Program 2:

module Main where

import Explosion (parseBiggest, produceString)
import Pipes.Parse (evalStateT)

main :: IO ()
main = do
    biggest <- evalStateT parseBiggest produceString
    print $ show biggest

Unfortunately, both programs eat about 200MB of memory total when I profile them, a problem I'd hoped the use of streaming parsers would solve. The first program spends most of its time and memory (> 70%) in (^.) from Lens.Family, while the second spends it in fmap, called by zoom from Lens.Family.State.Strict. The usage graphs are below. Both programs spend about 70% of their time doing garbage collection.

Am I doing something wrong? Is the Prelude function max not strict enough? I can't tell if the library functions are bad, or if I'm using the library wrong! (It's probably the latter.)

For completeness, here's a git repo that you can clone and run cabal install in if you'd like to see what I'm talking about first-hand, and here's the memory usage of the two programs:

memory usage of program 1memory usage of program 2

like image 468
immutablestate Avatar asked May 19 '14 13:05

immutablestate


1 Answers

Wrapping a strict bytestring in a single yield doesn't make it lazy. You have to yield smaller chunks to get any streaming behavior.

Edit: I found the error. pipes-aeson internally uses a consecutively function defined like this:

consecutively parser = step where
    step p0 = do
      (mr, p1) <- lift $
         S.runStateT atEndOfBytes (p0 >-> PB.dropWhile B.isSpaceWord8)
      case mr of
         Just r  -> return (Right r)
         Nothing -> do
            (ea, p2) <- lift (S.runStateT parser p1)
            case ea of
               Left  e -> return (Left (e, p2))
               Right a -> yield a >> step p2

The problematic line is the one with PB.dropWhile. This adds a quadratic blow up proportional to the number of parsed elements.

What happens is that the pipe that is threaded through this computation accumulates a new cat pipe downstream of it after each parse. So after N parses you get N cat pipes, which adds O(N) overhead to each parsed element.

I've created a Github issue to fix this. pipes-aeson is maintained by Renzo and he has fixed this issue before.

Edit: I've submitted a pull request to fix a second problem (you needed to use the intercalate for lazy bytestrings). Now the program runs in 5 KB constant space for both versions:

enter image description here

enter image description here

like image 103
Gabriella Gonzalez Avatar answered Nov 05 '22 13:11

Gabriella Gonzalez