Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory efficient strings in Haskell

Tags:

The commonly recommended Haskell string types seem to be ByteString or Text. I often work with a very large number of short (English word sized) strings, and typically need to store them in a lookup table such as Data.Map. In many cases I find that in this scenario, a table of Strings can take up less memory then a table of ByteStrings. Unboxed Data.Vectors of Word8 are also (much) more compact than ByteStrings.

What is the best practice when one needs to store and compare large numbers of small strings in Haskell?

Below I have tried to condense a particular problematic case into a small example:

import qualified Data.ByteString.Lazy.Char8 as S import qualified Data.ByteString as Strict import qualified Data.Map as Map import qualified Data.Vector.Unboxed as U import qualified Data.Serialize as Serialize import Control.Monad.State  main =   putStr    . unlines . map show . flip evalState (0,Map.empty)    . mapM toInt    . S.words   =<<   S.getContents   toInt x = do     let x' =              U.fromList . Strict.unpack .  -- Comment this line to increase memory usage            Serialize.encode $ x     (i,t) <- get   case Map.lookup x' t of     Just j -> return j     Nothing -> do        let i' = i + (1::Int)       put (i', Map.insert x' i t)       return i 

When I run this on a file containing around 400.000 words of English text, the version with strict bytestring keys uses around 50MB memory, the one with Word8 vectors uses 6MB.

like image 239
Grzegorz Chrupała Avatar asked Feb 22 '12 16:02

Grzegorz Chrupała


1 Answers

In the absence of other answers, I'm going to go out on a limb here.

What is the best practice when one needs to store and compare large numbers of small strings in Haskell?

If the small strings are meant to be human readable (e.g. an English word) then use Text. If they are meant to be read only by the computer, use ByteString. The decision to use strict or lazy variants of these depends on how you build and use these small strings.

You shouldn't need to use your own unboxed Vectors of Word8. If you experiencing a specific situation where regular String is faster than Text or ByteString, then throw the details up on StackOverflow and we'll try to figure out why. If you perform detailed analysis and can prove that an unboxed Vector of Word8 consistently works significantly better than Text or ByteString, then start conversations on mailing lists, irc, reddit, etc; the standard libraries are not set in stone, and improvements are always welcome.

But I think it highly likely that you are just doing something weird, as hammar and shang suggest.

P.S. for your particular use case, instead of storing a lot of small strings, you should consider a more appropriate data structure catered to your needs, e.g. a Trie as danr suggests.

like image 168
Dan Burton Avatar answered Sep 23 '22 02:09

Dan Burton