I have the following Haskell code, implementing a simple version of the "cat" unix command-line utility. Testing performance with "time" on a 400MB file, it's about 3x slower. (the exact script I am using to test it is below the code).
My questions are:
Regarding questions 2 and 3: I have used GHC -prof, then running with +RTS -p, but I'm finding the output a bit uninformative here.
Source (Main.hs)
module Main where
import System.IO
import System.Environment
import Data.ByteString as BS
import Control.Monad
-- Copied from cat source code
bufsize = 1024*128
go handle buf = do
hPut stdout buf
eof <- hIsEOF handle
unless eof $ do
buf <- hGetSome handle bufsize
go handle buf
main = do
file <- fmap Prelude.head getArgs
handle <- openFile file ReadMode
buf <- hGetSome handle bufsize
hSetBuffering stdin $ BlockBuffering (Just bufsize)
hSetBuffering stdout $ BlockBuffering (Just bufsize)
go handle buf
Timing script (run.sh):
#!/usr/bin/env bash
# Generate 10M lines of silly test data
yes aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | head -n 10000000 > huge
# Compile with optimisation
ghc -O2 Main.hs
# Run haskell
echo "timing Haskell"
time ./Main huge > /dev/null
echo ""
echo ""
# Run cat
echo "timing 'cat'"
time cat huge > /dev/null
My results:
timing Haskell
real 0m0.980s
user 0m0.296s
sys 0m0.684s
timing 'cat'
real 0m0.304s
user 0m0.001s
sys 0m0.302s
The profiling report when compiling with -prof and running with +RTS -p is below:
Sat Dec 13 21:26 2014 Time and Allocation Profiling Report (Final)
Main +RTS -p -RTS huge
total time = 0.92 secs (922 ticks @ 1000 us, 1 processor)
total alloc = 7,258,596,176 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
MAIN MAIN 100.0 100.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 46 0 100.0 100.0 100.0 100.0
CAF GHC.Conc.Signal 84 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 82 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 81 0 0.0 0.0 0.0 0.0
CAF System.Posix.Internals 76 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 70 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 69 0 0.0 0.0 0.0 0.0
This is only a partial answer trying to address the second question:
I tried something like this using GHC.IO.Buffer
API:
module Main where import System.IO import System.Environment import GHC.IO.Buffer import Data.ByteString as BS import Control.Monad -- Copied from cat source code bufsize = 1024*128 go handle bufPtr = do read <- hGetBuf handle bufPtr bufsize when (read > 0) $ do hPutBuf stdout bufPtr read go handle bufPtr main = do file <- fmap Prelude.head getArgs handle <- openFile file ReadMode buf <- newByteBuffer bufsize WriteBuffer withBuffer buf $ go handle
and it seems to come closer to the performance of 'cat', but still definitely slower...
time ./Cat huge > /dev/null ./Cat huge > /dev/null 0.00s user 0.06s system 76% cpu 0.081 total time cat huge > /dev/null cat huge > /dev/null 0.00s user 0.05s system 75% cpu 0.063 total
I think using the buffer API, we can explicitly avoid allocating all the buffer bytestrings when using hGetSome
like in the original code, but I am just guessing here and don't know either what exactly is happening in both compiled codes...
UPDATE: Adding the original code's performance on my laptop:
time ./Cat2 huge > /dev/null ./Cat2 huge > /dev/null 0.12s user 0.10s system 99% cpu 0.219 total
UPDATE 2: Adding some basic profiling results:
Original Code:
Cat2 +RTS -p -RTS huge total time = 0.21 secs (211 ticks @ 1000 us, 1 processor) total alloc = 6,954,068,112 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc MAIN MAIN 100.0 100.0 individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 46 0 100.0 100.0 100.0 100.0 CAF GHC.IO.Handle.FD 86 0 0.0 0.0 0.0 0.0 CAF GHC.Conc.Signal 82 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding 80 0 0.0 0.0 0.0 0.0 CAF GHC.IO.FD 79 0 0.0 0.0 0.0 0.0 CAF System.Posix.Internals 75 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding.Iconv 72 0 0.0 0.0 0.0 0.0
Buffer-API Code:
Cat +RTS -p -RTS huge total time = 0.06 secs (61 ticks @ 1000 us, 1 processor) total alloc = 3,487,712 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc MAIN MAIN 100.0 98.9 individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 44 0 100.0 98.9 100.0 100.0 CAF GHC.IO.Handle.FD 85 0 0.0 1.0 0.0 1.0 CAF GHC.Conc.Signal 82 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding 80 0 0.0 0.1 0.0 0.1 CAF GHC.IO.FD 79 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding.Iconv 71 0 0.0 0.0 0.0 0.0
Notice especially the big difference in allocation costs...
The original question made me think it was about finding a performance issue in the exact code provided. Since the comment "I'm hoping to go for a more idiomatic / "high level" Haskell solution" contradicts that assumption, I'll give the reasonably performing idiomatic Haskell solution.
The way I would expect any random programmer familiar with Haskell to solve this problem is with Lazy bytestrings. This allows the programmer to simply specify the task of reading input and putting output while letting the compiler worry about mucking with the buffering and looping constructs.
module Main where
import System.IO
import System.Environment
import Data.ByteString.Lazy as BS
import Control.Monad
main :: IO ()
main = do
file <- fmap Prelude.head getArgs
handle <- openFile file ReadMode
buf <- BS.hGetContents handle
hPut stdout buf
The result is both more readable and better performing than the code in the original question:
timing 'cat'
real 0m0.075s
user 0m0.000s
sys 0m0.074s
timing strict bytestring with GHC -O2
real 0m0.254s
user 0m0.126s
sys 0m0.127s
timing strict bytestring with GHC -O2 -fllvm
real 0m0.267s
user 0m0.132s
sys 0m0.134s
timing lazy bytestring with GHC -O2
real 0m0.091s
user 0m0.023s
sys 0m0.067s
timing lazy bytestring with GHC -O2 -fllvm
real 0m0.091s
user 0m0.021s
sys 0m0.069s
That is, the lazy bytestring solution is 21% slower than cat
. Putting cat
last for preferential caching behavior results in 59ms runtime placing the Haskell solution at 51% slower.
EDIT: Dons suggested using memory mapped IO would more accurately model cat's behavior. I'm not sure how accurate that statement is but mmap almost always results in better performance and this situation is certainly no exception:
timing memory mapped lazy bytestring with GHC -O2
real 0m0.008s
user 0m0.004s
sys 0m0.003s
Which was produced by:
module Main where
import System.IO (stdout)
import System.Environment
import System.IO.Posix.MMap.Lazy
import Data.ByteString.Lazy (hPut)
import Control.Monad
main :: IO ()
main = do
file <- fmap Prelude.head getArgs
buf <- unsafeMMapFile file
hPut stdout buf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With