I have the following snippet:
import qualified Data.Vector as V
import qualified Data.ByteString.Lazy as BL
import System.Environment
import Data.Word
import qualified Data.List.Stream as S
histogram :: [Word8] -> V.Vector Int
histogram c = V.accum (+) (V.replicate 256 0) $ S.zip (map fromIntegral c) (S.repeat 1)
mkHistogram file = do
hist <- (histogram . BL.unpack) `fmap` BL.readFile file
print hist
I see it like this: Nothing is done until printing. When printing the thunks are unwinded by first unpacking, then mapping fromIntegral one Word8 at a time. Each of these word8's are zipped with 1, again one value at a time. This tuples are then taken by the accumulator function which updates the array, one tuple/Word8 at a time. Then we move to the next thunk and repeat until no more content left.
This would allow for creating histograms in constant memory, but alas this is not happening, but instead it crashes with stack overflow. If I try to profile it, I see it running to the end, but taking memory a lot (300-500 Mb for a 2.5 Mb file). Memory is obtained linearly until the end until it can be released, forming a "nice" triangular graph.
Where did my reasoning go wrong and what steps should I take to make this run in constant memory?
The paper, published in Cognitive Science, is called The Selective Laziness of Reasoning and it's from cognitive scientists Emmanuel Trouche and colleagues. By "selective laziness", Trouche et al. are referring to our tendency to only bother scrutinizing arguments coming from other people who we already disagree with.
The main prediction of the selective laziness account is that participants would reject many of the arguments they previously made, in particular bad arguments. By contrast, they should be more likely to accept their own good arguments.
I believe the problem is that Data.Vector
is not strict in its elements. So although your reasoning is right, when accumulating the histogram your thunks looks like:
<1+(1+(1+0)) (1+(1+0)) 0 0 (1+(1+(1+(1+0)))) ... >
Rather than
<3 2 0 0 4 ...>
And only when you print are those sums computed. I don't see a strict accum
function in the docs (shame), and there isn't any place to hook in a seq
. One way out of this predicament may be to use Data.Vector.Unboxed
instead, since unboxed types are unlifted aka strict. Maybe you could request a strict accum
function with your example as a use case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With