I have this simple code in Python:
input = open("baseforms.txt","r",encoding='utf8')
S = {}
for i in input:
words = i.split()
S.update( {j:words[0] for j in words} )
print(S.get("sometext","not found"))
print(len(S))
It requires 300MB for work. "baseforms.txt" size is 123M. I've wrote the same code in Haskell:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Text.Lazy.Encoding(decodeUtf8)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as I
import Control.Monad(liftM)
main = do
text <- B.readFile "baseforms.txt"
let m = (M.fromList . (concatMap (parseLine.decodeUtf8))) (B.lines text)
print (M.lookup "sometext" m)
print (M.size m)
where
parseLine line = let base:forms = T.words line in [(f,base)| f<-forms]
It requires 544 MB and it's slower than Python version. Why? Is it possible to optimise Haskell version?
There is a lot happening in the Haskell version that's not happening in the Python version.
readFile
uses lazy IO, which is a bit weird in general. I would generally avoid lazy IO.
The file, as a bytestring, is broken into lines which are then decoded as UTF-8. This seems a little unnecessary, given the existence of Text
IO functions.
The Haskell version is using a tree (Data.Map
) whereas the Python version is using a hash table.
The strings are all lazy, which is probably not necessary if they're relatively short. Lazy strings have a couple words of overhead per string, which can add up. You could fuse the lazy strings, or you could read the file all at once, or you could use something like conduit.
GHC uses a copying collector, whereas the default Python implementation uses malloc()
with reference counting and the occasional GC. This fact alone can account for large differences in memory usage, depending on your program.
Who knows how many thunks are getting created in the Haskell version.
It's unknown whether you've enabled optimizations.
It's unknown how much slower the Haskell version is.
We don't have your data file so we can't really test it ourselves.
It's a bit late, but I studied this a little and think Dietrich Epp's account is right, but can be simplified a little. Notice that there doesn't seem to be any real python programming going on in the python file: it is orchestrating a very simple sequence of calls to C string operations and then to a C hash table implementation. (This is often a problem with really simple python v. Haskell benchmarks.) The Haskell, by contrast, is building an immense persistent Map
, which is a fancy tree. So the main points of opposition here are C vs Haskell, and hashtable-with-destructive-update vs persistent map. Since there is little overlap in the input file, the tree you are constructing includes all the information in the input string, some of it repeated, and then rearranged with a pile of Haskell constructors. This is I think the source of the alarm you are experiencing, but it can be explained.
Compare these two files, one using ByteString:
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
main = do m <- fmap proc (B.readFile "baseforms.txt")
print (M.lookup (B.pack "sometext") m)
print (M.size m)
proc = M.fromList . concatMap (\(a:bs) -> map (flip (,) a) bs)
. map B.words . B.lines
and the other a Text-ified equivalent:
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
import Data.Text.Encoding(decodeUtf8)
import qualified Data.Text as T
main = do
m <- fmap proc (B.readFile "baseforms.txt")
print (M.lookup (T.pack "sometext") m)
print (M.size m)
proc = M.fromList . concatMap (\(a:bs) -> map (flip (,) a) bs)
. map T.words . T.lines . decodeUtf8
On my machine, the python/C takes just under 6 seconds, the bytestring file takes 8 seconds, and the text file just over 10.
The bytestring implementation seems to use a bit more memory than the python, the text implementation distinctly more. The text implementation takes more time because, of course, it adds a conversion to text and then uses text operations to break the string and text comparisons to build the map.
Here is a go at analyzing the memory phenomena in the text case. First we have the bytestring in memory (130m). Once the text is constructed (~250m, to judge unscientifically from what's going on in top
), the bytestring is garbage collected while we construct the tree. In the end the text tree (~380m it looks like) uses more memory than the bytestring tree (~260m) because the text fragments in the tree are bigger. The program as a whole uses more because the text held in memory during the tree construction is itself bigger. To put it crudely: each bit of white-space is being turned into a tree constructor and two text constructors together with the text version of whatever the first 'word' of the line was and whatever the text representation next word is. The weight of the constructors seems in either case to be about 130m, so at the last moment of the construction of the tree we are using something like 130m + 130m + 130m = 390m in the bytestring case, and 250m + 130m + 250m = 630m in the text case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With