Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell iteratee: simple worked example of stripping trailing whitespace

I'm trying to understand how to use the iteratee library with Haskell. All of the articles I've seen so far seem to focus on building an intuition for how iteratees could be built, which is helpful, but now that I want to get down and actually use them, I feel a bit at sea. Looking at the source code for iteratees has been of limited value for me.

Let's say I have this function which trims trailing whitespace from a line:

import Data.ByteString.Char8

rstrip :: ByteString -> ByteString
rstrip = fst . spanEnd isSpace

What I'd like to do is: make this into an iteratee, read a file and write it out somewhere else with the trailing whitespace stripped from each line. How would I go about structuring that with iteratees? I see there's an enumLinesBS function in Data.Iteratee.Char which I could plumb into this, but I don't know if I should use mapChunks or convStream or how to repackage the function above into an iteratee.

like image 425
Daniel Lyons Avatar asked Jul 10 '11 20:07

Daniel Lyons


1 Answers

If you just want code, it's this:

procFile' iFile oFile = fileDriver (joinI $
   enumLinesBS ><>
   mapChunks (map rstrip) $
   I.mapM_ (B.appendFile oFile))
   iFile

Commentary:

This is a three-stage process: first you transform the raw stream into a stream of lines, then you apply your function to convert that stream of lines, and finally you consume the stream. Since rstrip is in the middle stage, it will be creating a stream transformer (Enumeratee).

You can use either mapChunks or convStream, but mapChunks is simpler. The difference is that mapChunks doesn't allow for you to cross chunk boundaries, whereas convStream is more general. I prefer convStream because it doesn't expose any of the underlying implementation, but if mapChunks is sufficient the resulting code is usually shorter.

rstripE :: Monad m => Enumeratee [ByteString] [ByteString] m a
rstripE = mapChunks (map rstrip)

Note the extra map in rstripE. The outer stream (which is the input to rstrip) has type [ByteString], so we need to map rstrip onto it.

For comparison, this is what it would look like if implemented with convStream:

rstripE' :: Enumeratee [ByteString] [ByteString] m a
rstripE' = convStream $ do
  mLine <- I.peek
  maybe (return B.empty) (\line -> I.drop 1 >> return (rstrip line)) mLine

This is longer, and it's less efficient because it will only apply the rstrip function to one line at a time, even though more lines may be available. It's possible to work on all of the currently available chunk, which is closer to the mapChunks version:

rstripE'2 :: Enumeratee [ByteString] [ByteString] m a
rstripE'2 = convStream (liftM (map rstrip) getChunk)

Anyway, with the stripping enumeratee available, it's easily composed with the enumLinesBS enumeratee:

enumStripLines :: Monad m => Enumeratee ByteString [ByteString] m a
enumStripLines = enumLinesBS ><> rstripE

The composition operator ><> follows the same order as the arrow operator >>>. enumLinesBS splits the stream into lines, then rstripE strips them. Now you just need to add a consumer (which is a normal iteratee), and you're done:

writer :: FilePath -> Iteratee [ByteString] IO ()
writer fp = I.mapM_ (B.appendFile fp)

processFile iFile oFile =
  enumFile defaultBufSize iFile (joinI $ enumStripLines $ writer oFile) >>= run

The fileDriver functions are shortcuts for simply enumerating over a file and running the resulting iteratee (unfortunately the argument order is switched from enumFile):

procFile2 iFile oFile = fileDriver (joinI $ enumStripLines $ writer oFile) iFile

Addendum: here's a situation where you would need the extra power of convStream. Suppose you want to concatenate every 2 lines into one. You can't use mapChunks. Consider when the chunk is a singleton element, [bytestring]. mapChunks doesn't provide any way to access the next chunk, so there's nothing else to concatenate with this. With convStream however, it's simple:

concatPairs = convStream $ do
  line1 <- I.head
  line2 <- I.head
  return $ line1 `B.append` line2

this looks even nicer in applicative style,

convStream $ B.append <$> I.head <*> I.head

You can think of convStream as continually consuming a portion of the stream with the provided iteratee, then sending the transformed version to the inner consumer. Sometimes even this isn't general enough, since the same iteratee is called at each step. In that case, you can use unfoldConvStream to pass state between successive iterations.

convStream and unfoldConvStream also allow for monadic actions, since the stream processing iteratee is a monad transformer.

like image 173
John L Avatar answered Nov 09 '22 11:11

John L