Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Building a histogram with haskell, many times slower than with python

Tags:

haskell

I was going to test naive bayes classification. One part of it was going to be building a histogram of the training data. The problem is, I am using a large training data, the haskell-cafe mailing list since a couple of years back, and there are over 20k files in the folder.

It takes a while over two minutes to create the histogram with python, and a little over 8 minutes with haskell. I'm using Data.Map (insertWith'), enumerators and text. What else can I do to speed up the program?

Haskell:

import qualified Data.Text as T
import qualified Data.Text.IO as TI
import System.Directory
import Control.Applicative
import Control.Monad (filterM, foldM)
import System.FilePath.Posix ((</>))
import qualified Data.Map as M
import Data.Map (Map)
import Data.List (foldl')
import Control.Exception.Base (bracket)
import System.IO (Handle, openFile, hClose, hSetEncoding, IOMode(ReadMode), latin1)
import qualified Data.Enumerator as E
import Data.Enumerator (($$), (>==>), (<==<), (==<<), (>>==), ($=), (=$))
import qualified Data.Enumerator.List as EL
import qualified Data.Enumerator.Text as ET



withFile' ::  (Handle -> IO c) -> FilePath -> IO c
withFile' f fp = do
  bracket
    (do
      h ← openFile fp ReadMode
      hSetEncoding h latin1
      return h)
    hClose
    (f)

buildClassHistogram c = do
  files ← filterM doesFileExist =<< map (c </> ) <$> getDirectoryContents c
  foldM fileHistogram M.empty files

fileHistogram m file = withFile' (λh → E.run_ $ enumHist h) file
  where
    enumHist h = ET.enumHandle h $$ EL.fold (λm' l → foldl' (λm'' w → M.insertWith' (const (+1)) w 1 m'') m' $ T.words l) m

Python:

for filename in listdir(root):
    filepath = root + "/" + filename
    # print(filepath)
    fp = open(filepath, "r", encoding="latin-1")
    for word in fp.read().split():
        if word in histogram:
            histogram[word] = histogram[word]+1
        else:
            histogram[word] = 1

Edit: Added imports

like image 653
Masse Avatar asked Mar 19 '12 14:03

Masse


2 Answers

You could try using imperative hash maps from the hashtables package: http://hackage.haskell.org/package/hashtables I remember I once got a moderate speedup compared to Data.Map. I wouldn't expect anything spectacular though.

UPDATE

I simplified your python code so I could test it on a single big file (100 million lines):

import sys
histogram={}
for word in sys.stdin.readlines():
    if word in histogram:
        histogram[word] = histogram[word]+1
    else:
        histogram[word] = 1
print histogram.get("the")

Takes 6.06 seconds

Haskell translation using hashtables:

{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Char8 as T
import  qualified Data.HashTable.IO as HT
main = do
  ls <- T.lines `fmap` T.getContents
  h <- HT.new :: IO (HT.BasicHashTable T.ByteString Int)
  flip mapM_ ls $ \w -> do
    r <- HT.lookup h w 
    case r of 
      Nothing -> HT.insert h w (1::Int)
      Just c  -> HT.insert h w (c+1)
  HT.lookup h "the" >>= print 

Run with a large allocation area: histogram +RTS -A500M Takes 9.3 seconds, with 2.4% GC. Still quite a bit slower than Python but not too bad.

According to the GHC user guide, you can change the RTS options while compiling:

GHC lets you change the default RTS options for a program at compile time, using the -with-rtsopts flag (Section 4.12.6, “Options affecting linking”). A common use for this is to give your program a default heap and/or stack size that is greater than the default. For example, to set -H128m -K64m, link with -with-rtsopts="-H128m -K64m".

like image 96
Grzegorz Chrupała Avatar answered Sep 18 '22 05:09

Grzegorz Chrupała


Your Haskell and Python implementations are using maps with different complexities. Python dictionaries are hash maps so the expected time for each operation (membership test, lookup, and insertion) is O(1). The Haskell version uses Data.Map which is a balanced binary search tree so the same operations take O(lg n) time. If you change your Haskell version to use a different map implementation, say a hash table or some sort of trie, it should get a lot quicker. However, I'm not familiar enough with the different modules implementing these data structures to say which is best. I'd start with the Data category on Hackage and look for one that you like. You might also look for a map that allows destructive updates like STArray does.

like image 27
Geoff Reedy Avatar answered Sep 21 '22 05:09

Geoff Reedy