Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient hash map container in Haskell?

I want to count unique blocks stored in a file using Haskell. The block is just consecutive bytes with a length of 512 and the target file has a size of at least 1GB.

This is my initial try.

import           Control.Monad
import qualified Data.ByteString.Lazy as LB
import           Data.Foldable
import           Data.HashMap
import           Data.Int
import qualified Data.List            as DL
import           System.Environment

type DummyDedupe = Map LB.ByteString Int64

toBlocks :: Int64 -> LB.ByteString -> [LB.ByteString]
toBlocks n bs | LB.null bs = []
              | otherwise = let (block, rest) = LB.splitAt n bs
                            in block : toBlocks n rest

dedupeBlocks :: [LB.ByteString] -> DummyDedupe -> DummyDedupe
dedupeBlocks = flip $ DL.foldl' (\acc block -> insertWith (+) block 1 $! acc)

dedupeFile :: FilePath -> DummyDedupe -> IO DummyDedupe
dedupeFile fp dd = LB.readFile fp >>= return . (`dedupeBlocks` dd) . toBlocks 512

main :: IO ()
main = do
  dd <- getArgs >>= (`dedupeFile` empty) . head
  putStrLn . show . (*512) . size $ dd
  putStrLn . show . (*512) . foldl' (+) 0 $ dd

It works, but I got frustrated with its execution time and memory usage. Especilly when I compared with those of C++ and even Python implementation listed below, it was 3~5x slower and consumed 2~3x more memory space.

import os
import os.path
import sys

def dedupeFile(dd, fp):
    fd = os.open(fp, os.O_RDONLY)
    for block in iter(lambda : os.read(fd, 512), ''):
        dd.setdefault(block, 0)
        dd[block] = dd[block] + 1
    os.close(fd)
    return dd

dd = {}
dedupeFile(dd, sys.argv[1])

print(len(dd) * 512)
print(sum(dd.values()) * 512)

I thought it was mainly due to the hashmap implementation, and tried other implementations such as hashmap, hashtables and unordered-containers. But there wasn't any noticeable difference.

Please help me to improve this program.

like image 879
comatose Avatar asked Feb 01 '13 03:02

comatose


2 Answers

I don't think you will be able to beat the performance of python dictionaries. They are actually implemented in c with years of optimizations put into it on the other hand hashmap is new and not that much optimized. So getting 3x performance in my opinion is good enough. You can optimize you haskell code at certain places but still it won't matter much. If you are still adamant about increasing performance I think you should use a highly optimized c library with ffi in your code.

Here are some of the similar discussions

haskell beginners

like image 120
Satvik Avatar answered Oct 13 '22 08:10

Satvik


This may be completely irrelevant depending on your usage, but I am slightly worried about insertWith (+) block 1. If your counts reach high numbers, you will accumulate thunks in the cells of the hash map. It doesn't matter that you used ($!), that only forces the spine -- the values are likely still lazy.

Data.HashMap provides no strict version insertWith' like Data.Map does. But you can implement it:

insertWith' :: (Hashable k, Ord k) => (a -> a -> a) -> k -> a 
                                   -> HashMap k a -> HashMap k a
insertWith' f k v m = maybe id seq maybeval m'
    where
    (maybeval, m') = insertLookupWithKey (const f) k v m

Also, you may want to output (but not input) a list of strict ByteStrings from toBlocks, which will make hashing faster.

That's all I've got -- I'm no performance guru, though.

like image 38
luqui Avatar answered Oct 13 '22 08:10

luqui