Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ByteString concatMap performance

I have a 37MB bin file I am trying to convert to a ppm sequence. It works fine, and I'm trying to use this as an exercise to learn some profiling and more about lazy bytestrings in Haskell. My program seems to bomb at the concatMap, which is used to replicate each byte three times so I have R, G, and B. The code is fairly straight forward - every 2048 bytes I write a new header:

{-# LANGUAGE OverloadedStrings #-}

import System.IO
import System.Environment
import Control.Monad
import qualified Data.ByteString.Lazy.Char8 as B


main :: IO ()
main = do [from, to] <- getArgs
          withFile from ReadMode $ \inH ->
            withFile to WriteMode $ \outH ->
                loop (B.hGet inH 2048) (process outH) B.null


loop :: (Monad m) => m a -> (a -> m ()) -> (a -> Bool) -> m ()
loop inp outp done = inp >>= \x -> unless (done x) (outp x >> loop inp outp done)


process :: Handle -> B.ByteString -> IO ()
process h bs | B.null bs = return ()
             | otherwise = B.hPut h header >> B.hPut h bs'
                           where header = "P6\n32 64\n255\n" :: B.ByteString
                                 bs'    = B.concatMap (B.replicate 3) bs

This pulls it off in a little over 5s. It's not terrible, and my only comparison is my very naive C implementation that does it a little under 4s - so that or ideally under has been my goal.

Here is the RTS from the above code:

  33,435,345,688 bytes allocated in the heap
      14,963,640 bytes copied during GC
          54,640 bytes maximum residency (77 sample(s))
          21,136 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     64604 colls,     0 par    0.20s    0.25s     0.0000s    0.0001s
  Gen  1        77 colls,     0 par    0.00s    0.01s     0.0001s    0.0006s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    5.09s  (  5.27s elapsed)
  GC      time    0.21s  (  0.26s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    5.29s  (  5.52s elapsed)

  %GC     time       3.9%  (4.6% elapsed)

  Alloc rate    6,574,783,667 bytes per MUT second

  Productivity  96.1% of total user, 92.1% of total elapsed

Pretty gnarly results. When I remove the concatMap and just copy everything over with the headers every 2048 bytes, it's practically instant:

      70,983,992 bytes allocated in the heap
          48,912 bytes copied during GC
          54,640 bytes maximum residency (2 sample(s))
          19,744 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       204 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.01s  (  0.07s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.02s  (  0.07s elapsed)

  %GC     time       9.6%  (2.9% elapsed)

  Alloc rate    5,026,838,892 bytes per MUT second

  Productivity  89.8% of total user, 22.3% of total elapsed

So I guess my question is two fold:

  • How can I improve the overall performance?
  • And had the bottleneck not been so obvious, what are some ways I could have tracked it down?

Thank you.

Edit

Here's the final code and RTS if anyone is interested! I was also able to find additional bottlenecks by making use of ghc's profiler with -prof -auto-all -caf-all after reading up on the Profiling and optimization chapter of Real World Haskell.

{-# LANGUAGE OverloadedStrings #-}

import System.IO
import System.Environment
import Control.Monad
import Data.Monoid
import qualified Data.ByteString.Builder    as BU
import qualified Data.ByteString.Lazy.Char8 as BL


main :: IO ()
main = do [from, to] <- getArgs
          withFile from ReadMode $ \inH ->
              withFile to WriteMode $ \outH ->
                  loop (BL.hGet inH 2048) (process outH) BL.null


loop :: (Monad m) => m a -> (a -> m ()) -> (a -> Bool) -> m ()
loop inp outp done = inp >>= \x -> unless (done x) (outp x >> loop inp outp done)


upConcatMap :: Monoid c => (Char -> c) -> BL.ByteString -> c
upConcatMap f bs = mconcat . map f $ BL.unpack bs


process :: Handle -> BL.ByteString -> IO ()
process h bs | BL.null bs = return ()
             | otherwise = BU.hPutBuilder h frame
                           where header = "P6\n32 64\n255\n"
                                 bs'    = BU.toLazyByteString $ upConcatMap trip bs
                                 frame  = BU.lazyByteString $ mappend header bs'
                                 trip c = let b = BU.char8 c in mconcat [b, b, b]

6,383,263,640 bytes allocated in the heap
      18,596,984 bytes copied during GC
          54,640 bytes maximum residency (2 sample(s))
          31,056 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     11165 colls,     0 par    0.06s    0.06s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.69s  (  0.83s elapsed)
  GC      time    0.06s  (  0.06s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.75s  (  0.89s elapsed)

  %GC     time       7.4%  (7.2% elapsed)

  Alloc rate    9,194,103,284 bytes per MUT second

  Productivity  92.6% of total user, 78.0% of total elapsed
like image 340
sudochop Avatar asked Oct 21 '22 09:10

sudochop


1 Answers

What about Builder?

This version is ~5x faster for me:

process :: Handle -> B.ByteString -> IO ()
process h bs
  | B.null bs = return ()
  | otherwise = B.hPut h header >> B.hPutBuilder h bs'
  where header = "P6\n32 64\n255\n" :: B.ByteString
        bs'    = mconcat $ map triple $ B.unpack bs 
        triple c = let b = B.char8 c in mconcat [b, b, b]

It allocates much less garbage.

ADD: for the reference, runtime statistics:

   4,642,746,104 bytes allocated in the heap
     390,110,640 bytes copied during GC
          63,592 bytes maximum residency (2 sample(s))
          21,648 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      8992 colls,     0 par    0.54s    0.63s     0.0001s    0.0017s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.98s  (  1.13s elapsed)
  GC      time    0.54s  (  0.63s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    1.52s  (  1.76s elapsed)

  %GC     time      35.4%  (36.0% elapsed)

  Alloc rate    4,718,237,910 bytes per MUT second

  Productivity  64.6% of total user, 55.9% of total elapsed
like image 136
Yuras Avatar answered Oct 23 '22 00:10

Yuras