As a little exercise, I made the following word counting program in haskell. It counts distinct words in a text file, and outputs the 50 most frequent ones along with their frequencies.
import qualified Data.Map as Map
import Data.List.Split
import Data.List
import Data.Ord
-- Count words
count = Map.toList . foldl' increment Map.empty
where
increment dict k = Map.insert k (1 + Map.findWithDefault 0 k dict) dict
-- Sort the counts
countAndSort = sortBy (flip $ comparing snd) . count
-- Pretty printing
pp :: Show a => [(String,a)] -> IO()
pp = putStrLn . foldl' format "" where
format text (x,y) = text ++ "\n" ++ x ++ "\t" ++ show y
main = readFile "pg13951.txt" >>= pp . take 50 .countAndSort . splitOn " "
The problem is that it's that it's 16 times slower than my python implementation with a mutable dict:
def increment(dic,word):
dic[word] = dic.get(word,0) + 1
return dic
print sorted(reduce(increment,open("pg13951.txt").read().split(),{}).items(),key=lambda e:-e[1])[:50]
I think the problem is due to the fact that ghc is constanstly realocating new maps when it could reuse the same one over and over. Run-time statitistics show a lot of allocations:
$ ghc -rtsopts count.hs
$ ./count +RTS -sstderr
de 7682
et 4423
la 4238
<snip>
d'Artagnan 511
M. 502
c'est 443
d'Artagnan, 443
705,888,048 bytes allocated in the heap
655,511,720 bytes copied during GC
139,823,800 bytes maximum residency (10 sample(s))
1,049,416 bytes maximum slop
287 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1366 colls, 0 par 2.16s 2.26s 0.0017s 0.0072s
Gen 1 10 colls, 0 par 2.86s 3.09s 0.3093s 1.5055s
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.18s ( 3.36s elapsed)
GC time 5.02s ( 5.36s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 8.20s ( 8.72s elapsed)
%GC time 61.2% (61.4% elapsed)
Alloc rate 221,831,366 bytes per MUT second
Productivity 38.8% of total user, 36.5% of total elapsed
My question is: is there a way to make this program perform better, without resorting to dirty tricks such as working in the IO monad, using mutable data structures, etc. ?
PS: the data file is available at the following URL: http://www.gutenberg.org/cache/epub/13951/pg13951.txt
Here are some quick and simple optimizations that I tried.
Original version on my machine:
real 0m1.539s
user 0m1.452s
sys 0m0.076s
Instead of using insert
and foldl'
you can use fromListWith
to count
the words.
count = Map.toList . Map.fromListWith (+) . flip zip (repeat 1)
This is over twice as fast.
real 0m0.687s
user 0m0.648s
sys 0m0.032s
The String
type is a linked list of characters which makes manipulating
strings rather elegant but inefficient. We can use the Text
type to get more
efficient string handling. I also rewrote your pp
function to use unlines
instead of foldl'
and use words
instead splitOn
for the original split.
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid
import Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Text.IO as T
pp :: Show a => [(Text,a)] -> IO()
pp = T.putStrLn . T.unlines . map format where
format (x,y) = x <> "\t" <> (T.pack $ show y)
main = T.readFile "pg13951.txt" >>= pp . take 50 .countAndSort . T.words
Again, twice as fast as the previous step.
real 0m0.330s
user 0m0.316s
sys 0m0.008s
Use the strict version of Map
import qualified Data.Map.Strict as Map
About 20% speed increase
real 0m0.265s
user 0m0.252s
sys 0m0.008s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With