Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fill up a Data.Map in a space&time efficient way

Coming back to Haskell four years after the first glimpse of it. I'm always as amazed by the expressiveness, and as baffled by my inability to predict space/time performance.

As a warm up, I took to translating a tiny toy program I had written in C++. It's about "cheating" at Scrabble. You input your game, and it outputs possible words you may play, with your letters alone or by crossing a letter on the board.

The whole thing revolves around a dictionary that's preloaded at start up. The words are then stored as lists in a map, together with their anagrams. The keys are strings of sorted letters. An example will speak more clearly :

Key : "AEHPS"  Value : ["HEAPS","PHASE","SHAPE"]

The C++ version reads the ~320000 words of the dictionary one at a time, in about 200ms in total. The resulting data structure is a hash-map stored in a array<99991, vector<string>> and takes up about 12 megabytes of memory.

The Haskell version reads the same dictionary in about 5 seconds, and the program heap size was ballooning up to 400 megabytes ! I changed the value type in the Data.Map from [String] to [ByteString] to save some memory, and that brought the program memory consumption down to roughly 290 megabytes. That's still 24 times more than my C++ version. That's more than mere "overhead", even though Data.Map is a tree instead of an array.

So I assume that I'm doing something wrong.

The whole module is visible here : (deprecated link)

I suppose that my trouble has something to do with the way that Data.Map gets built incrementally, growing upon previous versions of itself ? Or with the data structure itself ? Or something else ?

I'll try other solutions, like Data.HashMap, or filling up Data.Map with fromListWith. Nevertheless, I'd like to build up some understanding of what's going on here. Thanks a lot for any insight !


Short answer :

Using Data.Map.Strict, forcing the evaluation of value elements, and storing the keys as ByteStrings too made the miracle of dividing the memory footprint nearly by 3. The result is 100Meg, which is only twice larger than a standard std::multimap<std::string, std::string> in C++ for the same dataset. No speedup though. Git updated.

Thanks a lot to all who contributed, there is interesting material down here !

like image 664
Mathias Dolidon Avatar asked Jun 02 '15 12:06

Mathias Dolidon


2 Answers

One mistake that you are making that hasn't been pointed out yet is that you are storing unevaluated thunks of the form B.pack word in the lists that are the values in your Map. That means that you are retaining essentially the entire input file in the inefficient String format during the construction of your Map, at a cost of 24 bytes per character in your input file. Using the Data.Map.Strict API makes no difference here, since the functions in that API only force the elements of the Map to weak head normal form, which for a list means only evaluating whether the outermost constructor is [] or (:), and not evaluating any of the list's elements.

Another improvement you can make is to use the ShortByteString type available in recent versions of bytestring (the one that comes with GHC 7.8 is new enough). This is specifically designed for minimizing memory usage when storing many short bytestrings, with the trade-off being that most operations on a ShortByteString require a copy.

András Kovács's Map example code would look like this with these changes:

{-# LANGUAGE BangPatterns #-}

import Control.Applicative
import Data.List
import qualified Data.Map.Strict as M
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Short as B (ShortByteString, toShort, fromShort)

shortPack = B.toShort . B.pack

main = do
  words <- lines <$> readFile "dict.txt"
  print $ M.size $
    M.fromListWith (++) $ map (\w -> let !x = shortPack w in (shortPack $ sort w, [x])) words

Each of these changes saves about 30% of the maximum residency in my tests, for a total of over 50% space usage savings.

like image 146
Reid Barton Avatar answered Oct 21 '22 08:10

Reid Barton


EDIT: Removed erroneous benchmarking and commentary, only left a minor bit of advice. See Reid Barton's answer for a direct solution to OP's question.

If you don't need to change the dictionary at runtime, then DAWG-s are pretty much the most space-time efficient solution you can get (at least for word games).

For example, we can generate and serialize a DAWG from your dictionary that takes only 295 Kb space, and supports quite efficient lookup and prefix matching:

import qualified Data.DAWG.Packed as D -- from my "packed-dawg" package

main = do
  words <- lines <$> readFile "dict.txt"
  D.toFile "dict.dawg" $ D.fromList words -- serialize as "dict.dawg"
like image 39
András Kovács Avatar answered Oct 21 '22 06:10

András Kovács