I've written a program that analyzes and performs operations on data from a file. My first implementation uses Data.ByteString to read the contents of the file. This contents is then converted to a Vector of samples, using Data.Vector.Unboxed. I then perform the processing and analyzation on this (unboxed) vector of sample values.
Just as an experiment, I wanted to know what would happen if I make use Haskell's laziness. I decided to do this simple test by using Data.ByteString.Lazy instead of Data.ByteString and Data.Vector instead of Data.Vector.Unboxed. I expected to see an improvement in memory usage. Even if my program eventually requires to know the value of every sample, I would still expect memory usage to rise incrementally. When I profiled my program, the results surprised me.
My original version finishes in about 20ms and its memory usage looks like this:
This looks like lazy behavior to me. Samples seem to be loaded into memory as they are needed by my program.
Using Data.Vector and Data.ByteString gave the following result:
This looks like the opposite of lazy behavior. All the samples seem to be loaded into memory at once and then removed one by one.
I suspected this had something to do with my misunderstanding of Boxed and Unboxed types, so I tried using Data.ByteString.Lazy with `Data.Vector.Unboxed'. This was the result:
I have no idea how to explain what I'm seeing here.
Could anyone explain the results I'm getting?
EDIT
I'm reading from the file using hGet, this give me a Data.ByteString.Lazy. I convert this ByteString to a Data.Vector of Floats via the following function:
toVector :: ByteString -> Vector Float
toVector bs = U.generate (BS.length bs `div` 3) $ \i ->
myToFloat [BS.index bs (3*i), BS.index bs (3*i+1), BS.index bs (3*i+2)]
where
myToFloat :: [Word8] -> Float
myToFloat words = ...
The floats are represented in 3 bytes.
The rest of the processing mostly consists applying higher-order functions (e.g. filter, map, etc.) to the data.
EDIT2
My parser contains a function that reads all the data from a file and returns this data in a vector of samples (using the previous toVector function). I've written two versions of this program, one with Data.ByteString and one with Data.ByteString.Lazy. I've used these two versions to perform a simple test:
main = do
[file] <- getArgs
samples <- getSamplesFromFile file
let slice = V.slice 0 100000 samples
let filtered = V.filter (>0) slice
print filtered
The strict version gives me the following memory usage:
The lazy version gives me the following memory usage:
This result seems to be the complete opposite of what I would be expecting. Could someone explain this? What is wrong with Data.ByteString.Lazy?
You are using length on a lazy bytestring. That will demand the whole string. If that was the only use of the input lazy bytestring, garbage collection could make it work in constant space. However, you access the string after that for further computation, forcing the whole data to persist in memory.
The solution to this would be avoid length altogether, and trying to fold over the lazy bytestring (just once!) so that streaming can do its job.
You can for instance do something like
myread :: ByteString -> [Float]
myread bs = case splitAt 3 bs of
([x1,x2,x3], end) -> myToFloat x1 x2 x3 : myread end
-- TODO handle shorter data as well
toVector bs = U.fromList $ myread bs
Probably there's a better way leveraging Vector stuff. U.unfoldr looks promising.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With