I am trying to find frequency of characters in file using Haskell. I want to be able to handle files ~500MB size.
What I've tried till now
It does the job but is a bit slow as it parses the file 256 times
calculateFrequency :: L.ByteString -> [(Word8, Int64)]
calculateFrequency f = foldl (\acc x -> (x, L.count x f):acc) [] [255, 254.. 0]
I have also tried using Data.Map but the program runs out of memory (in ghc interpreter).
import qualified Data.ByteString.Lazy as L
import qualified Data.Map as M
calculateFrequency' :: L.ByteString -> [(Word8, Int64)]
calculateFrequency' xs = M.toList $ L.foldl' (\m word -> M.insertWith (+) word 1 m) (M.empty) xs
Here's an implementation using mutable, unboxed vectors instead of higher level constructs. It also uses conduit
for reading the file to avoid lazy I/O.
import Control.Monad.IO.Class
import qualified Data.ByteString as S
import Data.Conduit
import Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import qualified Data.Vector.Unboxed.Mutable as VM
import Data.Word (Word8)
type Freq = VM.IOVector Int
newFreq :: MonadIO m => m Freq
newFreq = liftIO $ VM.replicate 256 0
printFreq :: MonadIO m => Freq -> m ()
printFreq freq =
liftIO $ mapM_ go [0..255]
where
go i = do
x <- VM.read freq i
putStrLn $ show i ++ ": " ++ show x
addFreqWord8 :: MonadIO m => Freq -> Word8 -> m ()
addFreqWord8 f w = liftIO $ do
let index = fromIntegral w
oldCount <- VM.read f index
VM.write f index (oldCount + 1)
addFreqBS :: MonadIO m => Freq -> S.ByteString -> m ()
addFreqBS f bs =
loop (S.length bs - 1)
where
loop (-1) = return ()
loop i = do
addFreqWord8 f (S.index bs i)
loop (i - 1)
-- | The main entry point.
main :: IO ()
main = do
freq <- newFreq
runResourceT
$ sourceFile "random"
$$ CL.mapM_ (addFreqBS freq)
printFreq freq
I ran this on 500MB of random data and compared with @josejuan's UArray-based answer:
I think it should be possible to keep much of the elegance of josejuan's high-level approach yet keep the speed of the mutable vector implementation, but I haven't had a chance to try implementing something like that yet. Also, note that with some general purpose helper functions (like Data.ByteString.mapM or Data.Conduit.Binary.mapM) the implementation could be significantly simpler without affecting performance.
You can play with this implementation on FP Haskell Center as well.
EDIT: I added one of those missing functions to conduit
and cleaned up the code a bit; it now looks like the following:
import Control.Monad.Trans.Class (lift)
import Data.ByteString (ByteString)
import Data.Conduit (Consumer, ($$))
import qualified Data.Conduit.Binary as CB
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Unboxed.Mutable as VM
import System.IO (stdin)
freqSink :: Consumer ByteString IO (V.Vector Int)
freqSink = do
freq <- lift $ VM.replicate 256 0
CB.mapM_ $ \w -> do
let index = fromIntegral w
oldCount <- VM.read freq index
VM.write freq index (oldCount + 1)
lift $ V.freeze freq
main :: IO ()
main = (CB.sourceHandle stdin $$ freqSink) >>= print
The only difference in functionality is how the frequency is printed.
@Alex answer is good but, with only 256 values (indexes) an array should be better
import qualified Data.ByteString.Lazy as L
import qualified Data.Array.Unboxed as A
import qualified Data.ByteString as B
import Data.Int
import Data.Word
fq :: L.ByteString -> A.UArray Word8 Int64
fq = A.accumArray (+) 0 (0, 255) . map (\c -> (c, 1)) . concat . map B.unpack . L.toChunks
main = L.getContents >>= print . fq
@alex code take (for my sample file) 24.81 segs, using array take 7.77 segs.
UPDATED:
although Snoyman solution is better, an improvement avoiding unpack
maybe
fq :: L.ByteString -> A.UArray Word8 Int64
fq = A.accumArray (+) 0 (0, 255) . toCounterC . L.toChunks
where toCounterC [] = []
toCounterC (x:xs) = toCounter x (B.length x) xs
toCounter _ 0 xs = toCounterC xs
toCounter x i xs = (B.index x i', 1): toCounter x i' xs
where i' = i - 1
with ~50% speedup.
UPDATED:
Using IOVector
as Snoyman is as Conduit
version (a bit faster really, but this is a raw code, better use Conduit
)
import Data.Int
import Data.Word
import Control.Monad.IO.Class
import qualified Data.ByteString.Lazy as L
import qualified Data.Array.Unboxed as A
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed.Mutable as V
fq :: L.ByteString -> IO (V.IOVector Int64)
fq xs =
do
v <- V.replicate 256 0 :: IO (V.IOVector Int64)
g v $ L.toChunks xs
return v
where g v = toCounterC
where toCounterC [] = return ()
toCounterC (x:xs) = toCounter x (B.length x) xs
toCounter _ 0 xs = toCounterC xs
toCounter x i xs = do
let i' = i - 1
w = fromIntegral $ B.index x i'
c <- V.read v w
V.write v w (c + 1)
toCounter x i' xs
main = do
v <- L.getContents >>= fq
mapM_ (\i -> V.read v i >>= liftIO . putStr . (++", ") . show) [0..255]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With