The commonly recommended Haskell string types seem to be ByteString or Text. I often work with a very large number of short (English word sized) strings, and typically need to store them in a lookup table such as Data.Map. In many cases I find that in this scenario, a table of Strings can take up less memory then a table of ByteStrings. Unboxed Data.Vectors of Word8 are also (much) more compact than ByteStrings.
What is the best practice when one needs to store and compare large numbers of small strings in Haskell?
Below I have tried to condense a particular problematic case into a small example:
import qualified Data.ByteString.Lazy.Char8 as S import qualified Data.ByteString as Strict import qualified Data.Map as Map import qualified Data.Vector.Unboxed as U import qualified Data.Serialize as Serialize import Control.Monad.State main = putStr . unlines . map show . flip evalState (0,Map.empty) . mapM toInt . S.words =<< S.getContents toInt x = do let x' = U.fromList . Strict.unpack . -- Comment this line to increase memory usage Serialize.encode $ x (i,t) <- get case Map.lookup x' t of Just j -> return j Nothing -> do let i' = i + (1::Int) put (i', Map.insert x' i t) return i
When I run this on a file containing around 400.000 words of English text, the version with strict bytestring keys uses around 50MB memory, the one with Word8 vectors uses 6MB.
In the absence of other answers, I'm going to go out on a limb here.
What is the best practice when one needs to store and compare large numbers of small strings in Haskell?
If the small strings are meant to be human readable (e.g. an English word) then use Text
. If they are meant to be read only by the computer, use ByteString
. The decision to use strict or lazy variants of these depends on how you build and use these small strings.
You shouldn't need to use your own unboxed Vector
s of Word8
. If you experiencing a specific situation where regular String
is faster than Text
or ByteString
, then throw the details up on StackOverflow and we'll try to figure out why. If you perform detailed analysis and can prove that an unboxed Vector
of Word8
consistently works significantly better than Text
or ByteString
, then start conversations on mailing lists, irc, reddit, etc; the standard libraries are not set in stone, and improvements are always welcome.
But I think it highly likely that you are just doing something weird, as hammar and shang suggest.
P.S. for your particular use case, instead of storing a lot of small strings, you should consider a more appropriate data structure catered to your needs, e.g. a Trie as danr suggests.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With