I have a list of key-value pairs and I want to count how many times each key occurs and what values it occurs with, but when I try, I get a stack overflow. Here's a simplified version of the code I'm running:
import Array
add (n, vals) val = n `seq` vals `seq` (n+1,val:vals)
histo = accumArray add (0,[]) (0,9) [(0, n) | n <- [0..5000000]]
main = print histo
When I compile this with 'ghc -O' and run it, I get "Stack space overflow: current size 8388608 bytes."
I think I know what's going on: accumArray has the same properties as foldl, and so I need a strict version of accumArray. Unfortunately, the only one I've found is in Data.Array.Unboxed, which doesn't work for an array of lists.
The documentation says that when the accumulating function is strict, then accumArray should be too, but I can't get this to work, and the discussion here claims that the documentation is wrong (at least for GHC).
Is there a strict version of accumArray other than the one in Data.Array.Unboxed? Or is there a better way to do what I want?
Well, strict doesn't necessarily mean that no thunks are created, it just means that if an argument is bottom, the result is bottom too. But accumArray
is not that strict, it just writes bottoms to the array if they occur. It can't really do anything else, since it must allow for non-strict functions that could produce defined values from intermediate bottoms. And the strictness analyser can't rewrite it so that the accumulation function is evaluated to WHNF on each write if it is strict, because that would change the semantics of the programme in a rather drastic way (an array containing some bottoms vs. bottom).
That said, I agree that there's an unfortunate lack of strict and eager functions in several areas.
For your problem, you can use a larger stack (+RTS -K128M
didn't suffice here, but 256M did), or you can use
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.ST
import GHC.Arr
strictAccumArray :: Ix i => (e -> a -> e) -> e -> (i,i) -> [(i,a)] -> Array i e
strictAccumArray fun ini (l,u) ies = case iar of
Array _ _ m barr -> Array l u m barr
where
iar = runSTArray $ do
let n = safeRangeSize (l,u)
stuff = [(safeIndex (l,u) n i, e) | (i, e) <- ies]
arr <- newArray (0,n-1) ini
let go ((i,v):ivs) = do
old <- unsafeRead arr i
unsafeWrite arr i $! fun old v
go ivs
go [] = return arr
go stuff
With a strict write, the thunks are kept small, so there's no stack overflow. But beware, the lists take a lot of space, so if your list is too long, you may get a heap exhaustion.
Another option would be to use a Data.Map
(or Data.IntMap
, if the version of containers is 0.4.1.0 or later) instead of an array, since that comes with insertWith'
, which forces the result of the combining function on use. The code could for example be
import qualified Data.Map as M -- or Data.IntMap
import Data.List (foldl')
histo :: M.Map Int (Int,[Int]) -- M.IntMap (Int,[Int])
histo = foldl' upd M.empty [(0,n) | n <- [0 .. 15000000]]
where
upd mp (i,n) = M.insertWith' add i (1,[n]) mp
add (j,val:_) (k,vals) = k `seq` vals `seq` (k+j,val:vals)
add _ pr = pr -- to avoid non-exhaustive pattern warning
Disadvantages of using a Map
are
a -> a -> a
, so it needs to be a bit more complicated in your case.Map
s and IntMap
s have some book-keeping overhead, so that will use more space than an array. But if the list of updates is large compared to the number of indices, the difference will be negligible (the overhead is k
words per key, independent of the size of the values) in this case, where the size of the values grows with each update.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With