Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell requires more memory than Python when I read map from file. Why?

Tags:

haskell

I have this simple code in Python:

input = open("baseforms.txt","r",encoding='utf8')
S = {}
for i in input:
    words = i.split()
    S.update( {j:words[0] for j in words} )
print(S.get("sometext","not found"))
print(len(S))

It requires 300MB for work. "baseforms.txt" size is 123M. I've wrote the same code in Haskell:

{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Text.Lazy.Encoding(decodeUtf8)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as I
import Control.Monad(liftM)

main = do
    text <- B.readFile "baseforms.txt"
    let m = (M.fromList . (concatMap (parseLine.decodeUtf8))) (B.lines text)
    print (M.lookup "sometext" m)
    print (M.size m)
    where
        parseLine line = let base:forms = T.words line in [(f,base)| f<-forms]

It requires 544 MB and it's slower than Python version. Why? Is it possible to optimise Haskell version?

like image 354
Alexander Razorenov Avatar asked Mar 25 '15 11:03

Alexander Razorenov


2 Answers

There is a lot happening in the Haskell version that's not happening in the Python version.

  • readFile uses lazy IO, which is a bit weird in general. I would generally avoid lazy IO.

  • The file, as a bytestring, is broken into lines which are then decoded as UTF-8. This seems a little unnecessary, given the existence of Text IO functions.

  • The Haskell version is using a tree (Data.Map) whereas the Python version is using a hash table.

  • The strings are all lazy, which is probably not necessary if they're relatively short. Lazy strings have a couple words of overhead per string, which can add up. You could fuse the lazy strings, or you could read the file all at once, or you could use something like conduit.

  • GHC uses a copying collector, whereas the default Python implementation uses malloc() with reference counting and the occasional GC. This fact alone can account for large differences in memory usage, depending on your program.

  • Who knows how many thunks are getting created in the Haskell version.

  • It's unknown whether you've enabled optimizations.

  • It's unknown how much slower the Haskell version is.

  • We don't have your data file so we can't really test it ourselves.

like image 193
Dietrich Epp Avatar answered Nov 03 '22 21:11

Dietrich Epp


It's a bit late, but I studied this a little and think Dietrich Epp's account is right, but can be simplified a little. Notice that there doesn't seem to be any real python programming going on in the python file: it is orchestrating a very simple sequence of calls to C string operations and then to a C hash table implementation. (This is often a problem with really simple python v. Haskell benchmarks.) The Haskell, by contrast, is building an immense persistent Map, which is a fancy tree. So the main points of opposition here are C vs Haskell, and hashtable-with-destructive-update vs persistent map. Since there is little overlap in the input file, the tree you are constructing includes all the information in the input string, some of it repeated, and then rearranged with a pile of Haskell constructors. This is I think the source of the alarm you are experiencing, but it can be explained.

Compare these two files, one using ByteString:

import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
main = do m <- fmap proc (B.readFile "baseforms.txt")
          print (M.lookup (B.pack "sometext") m)
          print (M.size m)
proc = M.fromList  . concatMap (\(a:bs) -> map (flip (,) a) bs) 
       . map B.words . B.lines

and the other a Text-ified equivalent:

import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
import Data.Text.Encoding(decodeUtf8)
import qualified Data.Text as T

main = do
    m <- fmap proc (B.readFile "baseforms.txt")
    print (M.lookup (T.pack "sometext") m)
    print (M.size m)

proc = M.fromList  . concatMap (\(a:bs) ->  map (flip (,) a) bs)
       . map T.words . T.lines  . decodeUtf8

On my machine, the python/C takes just under 6 seconds, the bytestring file takes 8 seconds, and the text file just over 10.

The bytestring implementation seems to use a bit more memory than the python, the text implementation distinctly more. The text implementation takes more time because, of course, it adds a conversion to text and then uses text operations to break the string and text comparisons to build the map.

Here is a go at analyzing the memory phenomena in the text case. First we have the bytestring in memory (130m). Once the text is constructed (~250m, to judge unscientifically from what's going on in top), the bytestring is garbage collected while we construct the tree. In the end the text tree (~380m it looks like) uses more memory than the bytestring tree (~260m) because the text fragments in the tree are bigger. The program as a whole uses more because the text held in memory during the tree construction is itself bigger. To put it crudely: each bit of white-space is being turned into a tree constructor and two text constructors together with the text version of whatever the first 'word' of the line was and whatever the text representation next word is. The weight of the constructors seems in either case to be about 130m, so at the last moment of the construction of the tree we are using something like 130m + 130m + 130m = 390m in the bytestring case, and 250m + 130m + 250m = 630m in the text case.

like image 40
Michael Avatar answered Nov 03 '22 21:11

Michael