Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this simple text analysis program so slow?

Tags:

haskell

This is my code for counting lines and words:

import System.IO
import Data.List
main = do
        hSetBinaryMode stdin True
        interact $ (\(w,l)->"line:"++(show l)++"\nwords:"++(show w)++"\n")
                   . foldl' (\(w,l) r-> w `seq` l `seq` (w+length r ,succ l) ) (0,0)
                   . lines

This takes about 10 seconds to run on a file of about 100 megabytes. I compared it to similar programs in Lua (9s), awk (20s), and wc -l -c (0.6s).

Why is this code so slow? What could be the problem?

like image 914
vzex Avatar asked Apr 02 '12 14:04

vzex


1 Answers

I/O with String is known to be less than fast in Haskell. The bytes read from the handle in general have to be converted to Unicode code points, and then a linked list is built from those. That's a lot of work causing a lot of allocation. In this case, the conversion to code points is a bit simpler, since you set stdin to binary mode, but the construction of the linked list of characters still takes a lot of time.

Another small factor is that your line count is using Integer, but that's minor and only plays a significant role when the I/O is up to speed.

If you need fast I/O, you have to use a type better suited for that. One possibility is using ByteString, for example

import Data.List
import qualified Data.ByteString.Lazy.Char8 as C
main = do
        txt <- C.getContents
        putStrLn $ (\(w,l)->"line:"++(show l)++"\nwords:"++(show w)++"\n"). foldl' (\(w,l) r-> w `seq` l `seq` (w+C.length r ,succ l) ) (0,0) . C.lines $ txt

does the job on a 94MB file in 0.12s on my box (wc -l -c takes 0.06s), while the original using String took 4.4s. It can be optimised further,

{-# LANGUAGE BangPatterns #-}
import Data.List
import qualified Data.ByteString.Lazy.Char8 as C
main = do
        txt <- C.getContents
        putStrLn $ (\(w,l)->"line:"++(show l)++"\nwords:"++(show w)++"\n"). loop 0 0 . C.lines $ txt

loop :: Int -> Int -> [C.ByteString] -> (Int,Int)
loop !w !l (ln:lns) = loop (w + fromIntegral (C.length ln)) (l+1) lns
loop w l _ = (w,l)

takes only 0.08s, which is decent enough for me to stop optimising there (a similar change for the String version brings the time down to 3.6s for that).

like image 154
Daniel Fischer Avatar answered Oct 27 '22 00:10

Daniel Fischer