I want to count unique blocks stored in a file using Haskell. The block is just consecutive bytes with a length of 512 and the target file has a size of at least 1GB.
This is my initial try.
import Control.Monad
import qualified Data.ByteString.Lazy as LB
import Data.Foldable
import Data.HashMap
import Data.Int
import qualified Data.List as DL
import System.Environment
type DummyDedupe = Map LB.ByteString Int64
toBlocks :: Int64 -> LB.ByteString -> [LB.ByteString]
toBlocks n bs | LB.null bs = []
| otherwise = let (block, rest) = LB.splitAt n bs
in block : toBlocks n rest
dedupeBlocks :: [LB.ByteString] -> DummyDedupe -> DummyDedupe
dedupeBlocks = flip $ DL.foldl' (\acc block -> insertWith (+) block 1 $! acc)
dedupeFile :: FilePath -> DummyDedupe -> IO DummyDedupe
dedupeFile fp dd = LB.readFile fp >>= return . (`dedupeBlocks` dd) . toBlocks 512
main :: IO ()
main = do
dd <- getArgs >>= (`dedupeFile` empty) . head
putStrLn . show . (*512) . size $ dd
putStrLn . show . (*512) . foldl' (+) 0 $ dd
It works, but I got frustrated with its execution time and memory usage. Especilly when I compared with those of C++ and even Python implementation listed below, it was 3~5x slower and consumed 2~3x more memory space.
import os
import os.path
import sys
def dedupeFile(dd, fp):
fd = os.open(fp, os.O_RDONLY)
for block in iter(lambda : os.read(fd, 512), ''):
dd.setdefault(block, 0)
dd[block] = dd[block] + 1
os.close(fd)
return dd
dd = {}
dedupeFile(dd, sys.argv[1])
print(len(dd) * 512)
print(sum(dd.values()) * 512)
I thought it was mainly due to the hashmap implementation, and tried other implementations such as hashmap
, hashtables
and unordered-containers
.
But there wasn't any noticeable difference.
Please help me to improve this program.
I don't think you will be able to beat the performance of python dictionaries. They are actually implemented in c with years of optimizations put into it on the other hand hashmap is new and not that much optimized. So getting 3x performance in my opinion is good enough. You can optimize you haskell code at certain places but still it won't matter much. If you are still adamant about increasing performance I think you should use a highly optimized c library with ffi in your code.
Here are some of the similar discussions
haskell beginners
This may be completely irrelevant depending on your usage, but I am slightly worried about insertWith (+) block 1
. If your counts reach high numbers, you will accumulate thunks in the cells of the hash map. It doesn't matter that you used ($!)
, that only forces the spine -- the values are likely still lazy.
Data.HashMap
provides no strict version insertWith'
like Data.Map
does. But you can implement it:
insertWith' :: (Hashable k, Ord k) => (a -> a -> a) -> k -> a
-> HashMap k a -> HashMap k a
insertWith' f k v m = maybe id seq maybeval m'
where
(maybeval, m') = insertLookupWithKey (const f) k v m
Also, you may want to output (but not input) a list of strict ByteStrings from toBlocks
, which will make hashing faster.
That's all I've got -- I'm no performance guru, though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With