Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell Lazy ByteString + read/write progress function

Tags:

I am learing Haskell Lazy IO.

I am looking for an elegant way to copy a large file (8Gb) while printing copy progress to console.

Consider the following simple program that copies a file silently.

module Main where  import System import qualified Data.ByteString.Lazy as B  main = do [from, to] <- getArgs           body <- B.readFile from           B.writeFile to body 

Imgine there is a callback function you want to use for reporting:

onReadBytes :: Integer -> IO () onReadBytes count = putStrLn $ "Bytes read: " ++ (show count) 

QUESTION: how to weave onReadBytes function into Lazy ByteString so it will be called back on successfull read? Or if this design is not good, then what is the Haskell way to do it?

NOTE: the frequency of callback is not important, it can be called every 1024 bytes or every 1 Mb -- not important

ANSWER: Many thanks to camccann for the answer. I suggest to read it entirely.

Bellow is my version of the code based on camccann's code, you may find it useful.

module Main where  import System import System.IO import qualified Data.ByteString.Lazy as B  main = do [from, to] <- getArgs           withFile from ReadMode $ \fromH ->             withFile to WriteMode $ \toH ->               copyH fromH toH $ \x -> putStrLn $ "Bytes copied: " ++ show x  copyH :: Handle -> Handle -> (Integer -> IO()) -> IO () copyH fromH toH onProgress =     copy (B.hGet fromH (256 * 1024)) (write toH) B.null onProgress     where write o x  = do B.hPut o x                           return . fromIntegral $ B.length x  copy :: (Monad m) => m a -> (a -> m Integer) -> (a -> Bool) -> (Integer -> m()) -> m() copy = copy_ 0  copy_ :: (Monad m) => Integer -> m a -> (a -> m Integer) -> (a -> Bool) -> (Integer -> m()) -> m() copy_ count inp outp done onProgress = do x <- inp                                           unless (done x) $                                             do n <- outp x                                                onProgress (n + count)                                                copy_ (n + count) inp outp done onProgress 
like image 602
oshyshko Avatar asked Jul 12 '11 17:07

oshyshko


2 Answers

First, I'd like to note that a fair number of Haskell programmers regard lazy IO in general with some suspicion. It technically violates purity, but in a limited way that (as far as I'm aware) isn't noticeable when running a single program on consistent input[0]. On the other hand, plenty of people are fine with it, again because it involves only a very restricted kind of impurity.

To create the illusion of a lazy data structure that's actually created with on-demand I/O, functions like readFile are implemented using sneaky shenanigans behind the scenes. Weaving in the on-demand I/O is inherent to the function, and it's not really extensible for pretty much the same reasons that the illusion of getting a regular ByteString from it is convincing.

Handwaving the details and writing pseudocode, something like readFile basically works like this:

lazyInput inp = lazyIO (lazyInput' inp) lazyInput' inp = do x <- readFrom inp                     if (endOfInput inp)                         then return []                         else do xs <- lazyInput inp                                 return (x:xs) 

...where each time lazyIO is called, it defers the I/O until the value is actually used. To invoke your reporting function each time the actual read occurs, you'd need to weave it in directly, and while a generalized version of such a function could be written, to my knowledge none exist.

Given the above, you have a few options:

  • Look up the implementation of the lazy I/O functions you're using, and implement your own that include the progress reporting function. If this feels like a dirty hack, that's because it pretty much is, but there you go.

  • Abandon lazy I/O and switch to something more explicit and composable. This is the direction that the Haskell community as a whole seems to be heading in, specifically toward some variation on Iteratees, which give you nicely composable little stream processor building blocks that have more predictable behavior. The downside is that the concept is still under active development so there's no consensus on implementation or single starting point for learning to use them.

  • Abandon lazy I/O and switch to plain old regular I/O: Write an IO action that reads a chunk, prints the reporting info, and processes as much input as it can; then invoke it in a loop until done. Depending on what you're doing with the input and how much you're relying on laziness in your processing, this could involve anything from writing a couple nearly-trivial functions to building a bunch of finite-state-machine stream processors and getting 90% of the way to reinventing Iteratees.

[0]: The underlying function here is called unsafeInterleaveIO, and to the best of my knowledge the only ways to observe impurity from it require either running the program on different input (in which case it's entitled to behave differently anyhow, it just may be doing so in ways that don't make sense in pure code), or changing the code in certain ways (i.e., refactorings that should have no effect can have non-local effects).


Here's a rough example of doing things the "plain old regular I/O" way, using more composable functions:

import System import System.IO import qualified Data.ByteString.Lazy as B  main = do [from, to] <- getArgs           -- withFile closes the handle for us after the action completes           withFile from ReadMode $ \inH ->             withFile to WriteMode $ \outH ->                 -- run the loop with the appropriate actions                 runloop (B.hGet inH 128) (processBytes outH) B.null  -- note the very generic type; this is useful, because it proves that the -- runloop function can only execute what it's given, not do anything else -- behind our backs. runloop :: (Monad m) => m a -> (a -> m ()) -> (a -> Bool) -> m () runloop inp outp done = do x <- inp                            if done x                              then return ()                              else do outp x                                      runloop inp outp done  -- write the output and report progress to stdout. note that this can be easily -- modified, or composed with other output functions. processBytes :: Handle -> B.ByteString -> IO () processBytes h bs | B.null bs = return ()                   | otherwise = do onReadBytes (fromIntegral $ B.length bs)                                    B.hPut h bs  onReadBytes :: Integer -> IO () onReadBytes count = putStrLn $ "Bytes read: " ++ (show count) 

The "128" up there is how many bytes to read at a time. Running this on a random source file in my "Stack Overflow snippets" directory:

$ runhaskell ReadBStr.hs Corec.hs temp Bytes read: 128 Bytes read: 128 Bytes read: 128 Bytes read: 128 Bytes read: 128 Bytes read: 128 Bytes read: 128 Bytes read: 128 Bytes read: 128 Bytes read: 128 Bytes read: 83 $ 
like image 151
C. A. McCann Avatar answered Sep 30 '22 19:09

C. A. McCann


Use Data.ByteString.Lazy.Progress. It allows you to print all kinds of metrics as data passes through.

like image 33
Matt Joiner Avatar answered Sep 30 '22 19:09

Matt Joiner