Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory usage for lazy datatypes

I've written a program that analyzes and performs operations on data from a file. My first implementation uses Data.ByteString to read the contents of the file. This contents is then converted to a Vector of samples, using Data.Vector.Unboxed. I then perform the processing and analyzation on this (unboxed) vector of sample values.

Just as an experiment, I wanted to know what would happen if I make use Haskell's laziness. I decided to do this simple test by using Data.ByteString.Lazy instead of Data.ByteString and Data.Vector instead of Data.Vector.Unboxed. I expected to see an improvement in memory usage. Even if my program eventually requires to know the value of every sample, I would still expect memory usage to rise incrementally. When I profiled my program, the results surprised me.

My original version finishes in about 20ms and its memory usage looks like this: enter image description here This looks like lazy behavior to me. Samples seem to be loaded into memory as they are needed by my program.

Using Data.Vector and Data.ByteString gave the following result: enter image description here This looks like the opposite of lazy behavior. All the samples seem to be loaded into memory at once and then removed one by one.

I suspected this had something to do with my misunderstanding of Boxed and Unboxed types, so I tried using Data.ByteString.Lazy with `Data.Vector.Unboxed'. This was the result: enter image description here I have no idea how to explain what I'm seeing here.

Could anyone explain the results I'm getting?

EDIT I'm reading from the file using hGet, this give me a Data.ByteString.Lazy. I convert this ByteString to a Data.Vector of Floats via the following function:

toVector :: ByteString -> Vector Float
toVector bs =  U.generate (BS.length bs `div` 3) $ \i ->
     myToFloat [BS.index bs (3*i), BS.index bs (3*i+1), BS.index bs (3*i+2)]
  where
    myToFloat :: [Word8] -> Float
    myToFloat words = ...

The floats are represented in 3 bytes.

The rest of the processing mostly consists applying higher-order functions (e.g. filter, map, etc.) to the data.

EDIT2 My parser contains a function that reads all the data from a file and returns this data in a vector of samples (using the previous toVector function). I've written two versions of this program, one with Data.ByteString and one with Data.ByteString.Lazy. I've used these two versions to perform a simple test:

main = do
  [file] <- getArgs
  samples <- getSamplesFromFile file
  let slice = V.slice 0 100000 samples
  let filtered = V.filter (>0) slice
  print filtered

The strict version gives me the following memory usage: enter image description here The lazy version gives me the following memory usage: enter image description here This result seems to be the complete opposite of what I would be expecting. Could someone explain this? What is wrong with Data.ByteString.Lazy?

like image 569
Thomas Vanhelden Avatar asked Apr 24 '26 13:04

Thomas Vanhelden


1 Answers

You are using length on a lazy bytestring. That will demand the whole string. If that was the only use of the input lazy bytestring, garbage collection could make it work in constant space. However, you access the string after that for further computation, forcing the whole data to persist in memory.

The solution to this would be avoid length altogether, and trying to fold over the lazy bytestring (just once!) so that streaming can do its job.

You can for instance do something like

myread :: ByteString -> [Float]
myread bs = case splitAt 3 bs of
   ([x1,x2,x3], end) -> myToFloat x1 x2 x3 : myread end
   -- TODO handle shorter data as well

toVector bs = U.fromList $ myread bs

Probably there's a better way leveraging Vector stuff. U.unfoldr looks promising.

like image 151
chi Avatar answered Apr 27 '26 03:04

chi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!