Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Frequency of characters

I am trying to find frequency of characters in file using Haskell. I want to be able to handle files ~500MB size.

What I've tried till now

  1. It does the job but is a bit slow as it parses the file 256 times

    calculateFrequency :: L.ByteString -> [(Word8, Int64)]
    calculateFrequency f = foldl (\acc x -> (x, L.count x f):acc) [] [255, 254.. 0]
    
  2. I have also tried using Data.Map but the program runs out of memory (in ghc interpreter).

    import qualified Data.ByteString.Lazy as L
    import qualified Data.Map as M
    
    calculateFrequency' :: L.ByteString -> [(Word8, Int64)]
    calculateFrequency' xs = M.toList $ L.foldl' (\m word -> M.insertWith (+) word 1 m) (M.empty) xs
    
like image 836
Ravi Upadhyay Avatar asked Jan 15 '14 08:01

Ravi Upadhyay


2 Answers

Here's an implementation using mutable, unboxed vectors instead of higher level constructs. It also uses conduit for reading the file to avoid lazy I/O.

import           Control.Monad.IO.Class
import qualified Data.ByteString             as S
import           Data.Conduit
import           Data.Conduit.Binary         as CB
import qualified Data.Conduit.List           as CL
import qualified Data.Vector.Unboxed.Mutable as VM
import           Data.Word                   (Word8)

type Freq = VM.IOVector Int

newFreq :: MonadIO m => m Freq
newFreq = liftIO $ VM.replicate 256 0

printFreq :: MonadIO m => Freq -> m ()
printFreq freq =
    liftIO $ mapM_ go [0..255]
  where
    go i = do
        x <- VM.read freq i
        putStrLn $ show i ++ ": " ++ show x

addFreqWord8 :: MonadIO m => Freq -> Word8 -> m ()
addFreqWord8 f w = liftIO $ do
    let index = fromIntegral w
    oldCount <- VM.read f index
    VM.write f index (oldCount + 1)

addFreqBS :: MonadIO m => Freq -> S.ByteString -> m ()
addFreqBS f bs =
    loop (S.length bs - 1)
  where
    loop (-1) = return ()
    loop i = do
        addFreqWord8 f (S.index bs i)
        loop (i - 1)

-- | The main entry point.
main :: IO ()
main = do
    freq <- newFreq
    runResourceT
        $  sourceFile "random"
        $$ CL.mapM_ (addFreqBS freq)
    printFreq freq

I ran this on 500MB of random data and compared with @josejuan's UArray-based answer:

  • conduit based/mutable vectors: 1.006s
  • UArray: 17.962s

I think it should be possible to keep much of the elegance of josejuan's high-level approach yet keep the speed of the mutable vector implementation, but I haven't had a chance to try implementing something like that yet. Also, note that with some general purpose helper functions (like Data.ByteString.mapM or Data.Conduit.Binary.mapM) the implementation could be significantly simpler without affecting performance.

You can play with this implementation on FP Haskell Center as well.

EDIT: I added one of those missing functions to conduit and cleaned up the code a bit; it now looks like the following:

import           Control.Monad.Trans.Class   (lift)
import           Data.ByteString             (ByteString)
import           Data.Conduit                (Consumer, ($$))
import qualified Data.Conduit.Binary         as CB
import qualified Data.Vector.Unboxed         as V
import qualified Data.Vector.Unboxed.Mutable as VM
import           System.IO                   (stdin)

freqSink :: Consumer ByteString IO (V.Vector Int)
freqSink = do
    freq <- lift $ VM.replicate 256 0
    CB.mapM_ $ \w -> do
        let index = fromIntegral w
        oldCount <- VM.read freq index
        VM.write freq index (oldCount + 1)
    lift $ V.freeze freq

main :: IO ()
main = (CB.sourceHandle stdin $$ freqSink) >>= print

The only difference in functionality is how the frequency is printed.

like image 99
Michael Snoyman Avatar answered Sep 18 '22 23:09

Michael Snoyman


@Alex answer is good but, with only 256 values (indexes) an array should be better

import qualified Data.ByteString.Lazy as L
import qualified Data.Array.Unboxed as A
import qualified Data.ByteString as B
import Data.Int
import Data.Word

fq :: L.ByteString -> A.UArray Word8 Int64
fq = A.accumArray (+) 0 (0, 255) . map (\c -> (c, 1)) . concat . map B.unpack . L.toChunks

main = L.getContents >>= print . fq

@alex code take (for my sample file) 24.81 segs, using array take 7.77 segs.

UPDATED:

although Snoyman solution is better, an improvement avoiding unpack maybe

fq :: L.ByteString -> A.UArray Word8 Int64
fq = A.accumArray (+) 0 (0, 255) . toCounterC . L.toChunks
     where toCounterC [] = []
           toCounterC (x:xs) = toCounter x (B.length x) xs
           toCounter  _ 0 xs = toCounterC xs
           toCounter  x i xs = (B.index x i', 1): toCounter x i' xs
                               where i' = i - 1

with ~50% speedup.

UPDATED:

Using IOVector as Snoyman is as Conduit version (a bit faster really, but this is a raw code, better use Conduit)

import           Data.Int
import           Data.Word
import           Control.Monad.IO.Class
import qualified Data.ByteString.Lazy          as L
import qualified Data.Array.Unboxed            as A
import qualified Data.ByteString               as B
import qualified Data.Vector.Unboxed.Mutable   as V

fq :: L.ByteString -> IO (V.IOVector Int64)
fq xs =
     do
       v <- V.replicate 256 0 :: IO (V.IOVector Int64)
       g v $ L.toChunks xs
       return v
     where g v = toCounterC
                 where toCounterC [] = return ()
                       toCounterC (x:xs) = toCounter x (B.length x) xs
                       toCounter  _ 0 xs = toCounterC xs
                       toCounter  x i xs = do
                                             let i' = i - 1
                                                 w  = fromIntegral $ B.index x i'
                                             c <- V.read v w
                                             V.write v w (c + 1)
                                             toCounter x i' xs

main = do
          v <- L.getContents >>= fq
          mapM_ (\i -> V.read v i >>= liftIO . putStr . (++", ") . show) [0..255]
like image 39
josejuan Avatar answered Sep 20 '22 23:09

josejuan