I'm trying to write a Haskell program to parse huge text file (about 14Gb), but i can't understand how to make it free unused data from memory or not to make stack overflow during foldr. Here is the program source:
import qualified Data.ByteString.Lazy.Char8 as LBS
import qualified Data.ByteString.Lex.Lazy.Double as BD
import System.Environment
data Vertex =
Vertex{
vertexX :: Double,
vertexY :: Double,
vertexZ :: Double}
deriving (Eq, Show, Read)
data Extent =
Extent{
extentMax :: Vertex,
extentMin :: Vertex}
deriving (Eq, Show, Read)
addToExtent :: Extent -> Vertex -> Extent
addToExtent ext vert = Extent vertMax vertMin where
(vertMin, vertMax) = (makeCmpVert max (extentMax ext) vert, makeCmpVert min (extentMin ext) vert) where
makeCmpVert f v1 v2 = Vertex(f (vertexX v1) (vertexX v2))
(f (vertexY v1) (vertexY v2))
(f (vertexZ v1) (vertexZ v2))
readCoord :: LBS.ByteString -> Double
readCoord l = case BD.readDouble l of
Nothing -> 0
Just (value, _) -> value
readCoords :: LBS.ByteString -> [Double]
readCoords l | LBS.length l == 0 = []
| otherwise = let coordWords = LBS.split ' ' l
in map readCoord coordWords
parseLine :: LBS.ByteString -> Vertex
parseLine line = Vertex (head coords) (coords!!1) (coords!!2) where
coords = readCoords line
processLines :: [LBS.ByteString] -> Extent -> Extent
processLines strs ext = foldr (\x y -> addToExtent y (parseLine x)) ext strs
processFile :: String -> IO()
processFile name = do
putStrLn name
content <- LBS.readFile name
let (countLine:recordsLines) = LBS.lines content
case LBS.readInt countLine of
Nothing -> putStrLn "Can't read records count"
Just (recordsCount, _) -> do
print recordsCount
let vert = parseLine (head recordsLines)
let ext = Extent vert vert
print $ processLines recordsLines ext
main :: IO()
main = do
args <- getArgs
case args of
[] -> do
putStrLn "Missing file path"
xs -> do
processFile (head xs)
return()
Text file contains lines with three floating point numbers delimited with space character. This program always tries to occupy all free memory on a computer and crashes with out of memory error.
You are being too lazy. Vertex
and Extent
have non-strict fields, and all your functions returning a Vertex
return
Vertex thunk1 thunk2
without forcing the components to be evaluated. Also addToExtent
directly returns an
Extent thunk1 thunk2
without evaluating the components.
Thus none of the ByteString
s actually is released early to be garbage-collected, since the Double
s are not parsed from them yet.
When that is fixed by making the fields of Vertex
and Extent
strict - or the functions returning a Vertex
resp. Extent
forcing all parts of their input, you have the problem that
processLines strs ext = foldr (\x y -> addToExtent y (parseLine x)) ext strs
can't start assembling the result before the end of the list of lines is reached because then
(\x y -> addToExtent y (parseLine x))
is strict in its second argument.
However, barring NaN
s and undefined values, if I didn't miss something, the result would be the same if you use a (strict!) left fold, so
processLines strs ext = foldl' (\x y -> addToExtent x (parseLine y)) ext strs
should produce the desired result without holding on to the data if Vertex
and Extent
get strict fields.
Ah, I did miss something:
addToExtent ext vert = Extent vertMax vertMin
where
(vertMin, vertMax) = (makeCmpVert max (extentMax ext) vert, makeCmpVert min (extentMin ext)
If that isn't a typo (what I expect it is), fixing that would be somewhat difficult.
I think it should be
(vertMax, vertMin) = ...
addToExtent
is too lazy. A possible alternative definition is
addToExtent :: Extent -> Vertex -> Extent
addToExtent ext vert = vertMax `seq` vertMin `seq` Extent vertMax vertMin where
(vertMin, vertMax) = (makeCmpVert max (extentMax ext) vert, makeCmpVert min (extentMinext) vert) where
makeCmpVert f v1 v2 = Vertex(f (vertexX v1) (vertexX v2))
(f (vertexY v1) (vertexY v2))
(f (vertexZ v1) (vertexZ v2))
data Vertex =
Vertex{
vertexX :: {-# UNPACK #-} !Double,
vertexY :: {-# UNPACK #-} !Double,
vertexZ :: {-# UNPACK #-} !Double}
deriving (Eq, Show, Read)
The problem is that vertMin
and vertMax
are never evaluated until the entire file is processed - resulted in two huge thunks in Extent
.
I also recommend changing the definition of Extent
to
data Extent =
Extent{
extentMax :: !Vertex,
extentMin :: !Vertex}
deriving (Eq, Show, Read)
(though with these changes, the seq
calls in addToExtent
become redundant).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With