I have a set of binary records packed into a file and I am reading them using Data.ByteString.Lazy and Data.Binary.Get. With my current implementation an 8Mb file takes 6 seconds to parse.
import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get
data Trade = Trade { timestamp :: Int, price :: Int , qty :: Int } deriving (Show)
getTrades = do
empty <- isEmpty
if empty
then return []
else do
timestamp <- getWord32le
price <- getWord32le
qty <- getWord16le
rest <- getTrades
let trade = Trade (fromIntegral timestamp) (fromIntegral price) (fromIntegral qty)
return (trade : rest)
main :: IO()
main = do
input <- BL.readFile "trades.bin"
let trades = runGet getTrades input
print $ length trades
What can I do to make this faster?
Refactoring it slightly (basically a left-fold) gives much better performance and lowers GC overhead quite a bit parsing a 8388600 byte file.
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get
data Trade = Trade
{ timestamp :: {-# UNPACK #-} !Int
, price :: {-# UNPACK #-} !Int
, qty :: {-# UNPACK #-} !Int
} deriving (Show)
getTrade :: Get Trade
getTrade = do
timestamp <- getWord32le
price <- getWord32le
qty <- getWord16le
return $! Trade (fromIntegral timestamp) (fromIntegral price) (fromIntegral qty)
countTrades :: BL.ByteString -> Int
countTrades input = stepper (0, input) where
stepper (!count, !buffer)
| BL.null buffer = count
| otherwise =
let (trade, rest, _) = runGetState getTrade buffer 0
in stepper (count+1, rest)
main :: IO()
main = do
input <- BL.readFile "trades.bin"
let trades = countTrades input
print trades
And the related runtime stats. Even though the allocation numbers are close, the GC and max heap size are quite a bit different between revisions.
All examples here were built with GHC 7.4.1 -O2.
The original source, run with +RTS -K1G -RTS due to excessive stack space usage:
426,003,680 bytes allocated in the heap 443,141,672 bytes copied during GC 99,305,920 bytes maximum residency (9 sample(s)) 203 MB total memory in use (0 MB lost due to fragmentation) Total time 0.62s ( 0.81s elapsed) %GC time 83.3% (86.4% elapsed)
Daniel's revision:
357,851,536 bytes allocated in the heap 220,009,088 bytes copied during GC 40,846,168 bytes maximum residency (8 sample(s)) 85 MB total memory in use (0 MB lost due to fragmentation) Total time 0.24s ( 0.28s elapsed) %GC time 69.1% (71.4% elapsed)
And this post:
290,725,952 bytes allocated in the heap 109,592 bytes copied during GC 78,704 bytes maximum residency (10 sample(s)) 2 MB total memory in use (0 MB lost due to fragmentation) Total time 0.06s ( 0.07s elapsed) %GC time 5.0% (6.0% elapsed)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With