Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell performance implementing unix's "cat" program with Data.ByteString

I have the following Haskell code, implementing a simple version of the "cat" unix command-line utility. Testing performance with "time" on a 400MB file, it's about 3x slower. (the exact script I am using to test it is below the code).

My questions are:

  1. Is this a valid test of performance?
  2. How can I make this program run faster?
  3. How can I identify performance bottlenecks in Haskell programs in general?

Regarding questions 2 and 3: I have used GHC -prof, then running with +RTS -p, but I'm finding the output a bit uninformative here.

Source (Main.hs)

module Main where

import System.IO
import System.Environment
import Data.ByteString as BS

import Control.Monad

-- Copied from cat source code
bufsize = 1024*128

go handle buf = do
  hPut stdout buf
  eof <- hIsEOF handle
  unless eof $ do
    buf <- hGetSome handle bufsize
    go handle buf

main = do
  file    <- fmap Prelude.head getArgs
  handle  <- openFile file ReadMode
  buf     <- hGetSome handle bufsize
  hSetBuffering stdin $ BlockBuffering (Just bufsize)
  hSetBuffering stdout $ BlockBuffering (Just bufsize)
  go handle buf

Timing script (run.sh):

#!/usr/bin/env bash

# Generate 10M lines of silly test data
yes aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | head -n 10000000 > huge

# Compile with optimisation
ghc -O2 Main.hs

# Run haskell
echo "timing Haskell"
time ./Main huge > /dev/null

echo ""
echo ""

# Run cat
echo "timing 'cat'"
time cat huge > /dev/null

My results:

timing Haskell

real    0m0.980s
user    0m0.296s
sys     0m0.684s


timing 'cat'

real    0m0.304s
user    0m0.001s
sys     0m0.302s

The profiling report when compiling with -prof and running with +RTS -p is below:

  Sat Dec 13 21:26 2014 Time and Allocation Profiling Report  (Final)

     Main +RTS -p -RTS huge

  total time  =        0.92 secs   (922 ticks @ 1000 us, 1 processor)
  total alloc = 7,258,596,176 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

MAIN        MAIN    100.0  100.0


                                                       individual     inherited
COST CENTRE MODULE                   no.     entries  %time %alloc   %time %alloc

MAIN        MAIN                      46           0  100.0  100.0   100.0  100.0
 CAF        GHC.Conc.Signal           84           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.FD                 82           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Handle.FD          81           0    0.0    0.0     0.0    0.0
 CAF        System.Posix.Internals    76           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding           70           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding.Iconv     69           0    0.0    0.0     0.0    0.0
like image 819
statusfailed Avatar asked Dec 13 '14 21:12

statusfailed


Video Answer


2 Answers

This is only a partial answer trying to address the second question:

I tried something like this using GHC.IO.Buffer API:

module Main where  import System.IO import System.Environment import GHC.IO.Buffer import Data.ByteString as BS  import Control.Monad  -- Copied from cat source code bufsize = 1024*128  go handle bufPtr = do   read <- hGetBuf handle bufPtr bufsize   when (read > 0) $ do     hPutBuf stdout bufPtr read     go handle bufPtr  main = do   file    <- fmap Prelude.head getArgs   handle  <- openFile file ReadMode   buf     <- newByteBuffer bufsize WriteBuffer    withBuffer buf $ go handle 

and it seems to come closer to the performance of 'cat', but still definitely slower...

time ./Cat huge > /dev/null  ./Cat huge > /dev/null  0.00s user 0.06s system 76% cpu 0.081 total  time cat huge > /dev/null   cat huge > /dev/null  0.00s user 0.05s system 75% cpu 0.063 total 

I think using the buffer API, we can explicitly avoid allocating all the buffer bytestrings when using hGetSome like in the original code, but I am just guessing here and don't know either what exactly is happening in both compiled codes...

UPDATE: Adding the original code's performance on my laptop:

time ./Cat2 huge > /dev/null ./Cat2 huge > /dev/null  0.12s user 0.10s system 99% cpu 0.219 total 

UPDATE 2: Adding some basic profiling results:

Original Code:

Cat2 +RTS -p -RTS huge      total time  =        0.21 secs   (211 ticks @ 1000 us, 1 processor)     total alloc = 6,954,068,112 bytes  (excludes profiling overheads)  COST CENTRE MODULE  %time %alloc  MAIN        MAIN    100.0  100.0                                                          individual     inherited COST CENTRE MODULE                   no.     entries  %time %alloc   %time %alloc  MAIN        MAIN                      46           0  100.0  100.0   100.0  100.0  CAF        GHC.IO.Handle.FD          86           0    0.0    0.0     0.0    0.0  CAF        GHC.Conc.Signal           82           0    0.0    0.0     0.0    0.0  CAF        GHC.IO.Encoding           80           0    0.0    0.0     0.0    0.0  CAF        GHC.IO.FD                 79           0    0.0    0.0     0.0    0.0  CAF        System.Posix.Internals    75           0    0.0    0.0     0.0    0.0  CAF        GHC.IO.Encoding.Iconv     72           0    0.0    0.0     0.0    0.0 

Buffer-API Code:

Cat +RTS -p -RTS huge      total time  =        0.06 secs   (61 ticks @ 1000 us, 1 processor)     total alloc =   3,487,712 bytes  (excludes profiling overheads)  COST CENTRE MODULE  %time %alloc  MAIN        MAIN    100.0   98.9                                                         individual     inherited COST CENTRE MODULE                  no.     entries  %time %alloc   %time %alloc  MAIN        MAIN                     44           0  100.0   98.9   100.0  100.0  CAF        GHC.IO.Handle.FD         85           0    0.0    1.0     0.0    1.0  CAF        GHC.Conc.Signal          82           0    0.0    0.0     0.0    0.0  CAF        GHC.IO.Encoding          80           0    0.0    0.1     0.0    0.1  CAF        GHC.IO.FD                79           0    0.0    0.0     0.0    0.0  CAF        GHC.IO.Encoding.Iconv    71           0    0.0    0.0     0.0    0.0 

Notice especially the big difference in allocation costs...

like image 84
bmk Avatar answered Nov 14 '22 13:11

bmk


The original question made me think it was about finding a performance issue in the exact code provided. Since the comment "I'm hoping to go for a more idiomatic / "high level" Haskell solution" contradicts that assumption, I'll give the reasonably performing idiomatic Haskell solution.

The way I would expect any random programmer familiar with Haskell to solve this problem is with Lazy bytestrings. This allows the programmer to simply specify the task of reading input and putting output while letting the compiler worry about mucking with the buffering and looping constructs.

module Main where

import System.IO
import System.Environment
import Data.ByteString.Lazy as BS

import Control.Monad

main :: IO ()
main = do
  file    <- fmap Prelude.head getArgs
  handle  <- openFile file ReadMode
  buf     <- BS.hGetContents handle
  hPut stdout buf

The result is both more readable and better performing than the code in the original question:

timing 'cat'

real    0m0.075s
user    0m0.000s
sys     0m0.074s
timing strict bytestring with GHC -O2

real    0m0.254s
user    0m0.126s
sys     0m0.127s
timing strict bytestring with GHC -O2 -fllvm

real    0m0.267s
user    0m0.132s
sys     0m0.134s
timing lazy bytestring with GHC -O2

real    0m0.091s
user    0m0.023s
sys     0m0.067s
timing lazy bytestring with GHC -O2 -fllvm

real    0m0.091s
user    0m0.021s
sys     0m0.069s

That is, the lazy bytestring solution is 21% slower than cat. Putting cat last for preferential caching behavior results in 59ms runtime placing the Haskell solution at 51% slower.

EDIT: Dons suggested using memory mapped IO would more accurately model cat's behavior. I'm not sure how accurate that statement is but mmap almost always results in better performance and this situation is certainly no exception:

timing memory mapped lazy bytestring with GHC -O2

real    0m0.008s
user    0m0.004s
sys     0m0.003s

Which was produced by:

module Main where

import System.IO (stdout)
import System.Environment
import System.IO.Posix.MMap.Lazy
import Data.ByteString.Lazy (hPut)

import Control.Monad

main :: IO ()
main = do
  file    <- fmap Prelude.head getArgs
  buf     <- unsafeMMapFile file
  hPut stdout buf
like image 31
Thomas M. DuBuisson Avatar answered Nov 14 '22 13:11

Thomas M. DuBuisson