My problem is that when using any of the Map implementations in Haskell that I always get a "Stack space overflow" when working with a million values.
What I'm trying to do is process a list of pairs. Each pair contains two Ints (not Integers, I failed miserably with them so I tried Ints instead). I want to go through each pair in the list and use the first Int as a key. For each unique key I want to build up a list of second elements where each of the second elements are in a pair that have the same first element. So what I want at the end is a "Map" from an Int to a list of Ints. Here's an example.
Given a list of pairs like this:
[(1,10),(2,11),(1,79),(3,99),(1,42),(3,18)]
I would like to end up with a "Map" like this:
{1 : [42,79,10], 2 : [11], 3 : [18,99]}
(I'm using a Python-like notation above to illustrate a "Map". I know it ain't Haskell. It's just there for illustrative purposes.)
So the first thing I tried was my own hand built version where I sorted the list of pairs of Ints and then went through the list building up a new list of pairs but this time the second element was a list. The first element is the key i.e. the unique Int values of the first element of each pair and the second element is a list of the second values of each original pair which have the key as the first element.
So given a list of pairs like this:
[(1,10),(2,11),(1,79),(3,99),(1,42),(3,18)]
I end up with a list of pairs like this:
[(1, [42,79,10], (2, [11]), (3, [18,99])]
This is easy to do. But there is one problem. The performance of the "sort" function on the original list (of 10 million pairs) is shockingly bad. I can generate the original list of pairs in less than a second. I can process the sorted list into my hand built map in less than a second. However, sorting the original list of pairs takes 40 seconds.
So I thought about using one of the built-in "Map" data structures in Haskell to do the job. The idea being I build my original list of pairs and then using standard Map functions to build a standard Map.
And that's where it all went pear-shaped. It works well on a list of 100,000 values but when I move to 1 million values, I get a "Stack space overflow" error.
So here's some example code that suffers from the problem. Please, please note that is not the actual code that I want to implement. It is just a very simplified version of code for which the same problem exists. I don't really want to separate a million consecutive numbers into odd and even partitions!!
import Data.IntMap.Strict(IntMap, empty, findWithDefault, insert, size)
power = 6
ns :: [Int]
ns = [1..10^power-1]
mod2 n = if odd n then 1 else 0
mod2Pairs = zip (map mod2 ns) ns
-- Takes a list of pairs and returns a Map where the key is the unique Int values
-- of the first element of each pair and the value is a list of the second values
-- of each pair which have the key as the first element.
-- e.g. makeMap [(1,10),(2,11),(1,79),(3,99),(1,42),(3,18)] =
-- 1 -> [42,79,10], 2 -> [11], 3 -> [18,99]
makeMap :: [(Int,a)] -> IntMap [a]
makeMap pairs = makeMap' empty pairs
where
makeMap' m [] = m
makeMap' m ((a, b):cs) = makeMap' m' cs
where
bs = findWithDefault [] a m
m' = insert a (b:bs) m
mod2Map = makeMap mod2Pairs
main = do
print $ "Yowzah"
print $ "length mod2Pairs="++ (show $ length mod2Pairs)
print $ "size mod2Map=" ++ (show $ size mod2Map)
When I run this, I get:
"Yowzah"
"length mod2Pairs=999999"
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
From the above output, it should be clear that the stack space overflow happens when I try to do "makeMap mod2Pairs".
But to my naive eye all this seems to do is go through a list of pairs and for each pair lookup a key (the first element of each pair) and A) if it doesn't find a match return an empty list or B) if it does find a match, return the list that has previously been inserted. In either case it "cons"'s the second element of the pair to the "found" list and inserts that back into the Map with the same key.
(PS instead of findWithDefault, I've also tried lookup and handled the Just and Nothing using case but to no avail.)
I've had a look through the Haskell documentation on the various Map implementations and from the point of view of performance in terms of CPU and memory (especially stack memory), it seems that A) a strict implementation and B) one where the keys are Ints would be the best. I have also tried Data.Map and Data.Strict.Map and they also suffer from the same problem.
I am convinced the problem is with the "Map" implementation. Am I right? Why would I get a stack overflow error i.e. what is the Map implementation doing in the background that is causing a stack overflow? Is it making lots and lots of recursive calls behind the scenes?
Can anyone help explain what is going on and how to get around the problem?
I don't have an old enough GHC to check (this works just fine for me, and I don't have 7.6.3 as you do), but my guess would be that your makeMap'
is too lazy. Probably this will fix it:
makeMap' m ((a, b):cs) = m `seq` makeMap' m' cs
Without it, you are building up a million-deep nested thunk, and deeply-nested thunks is the traditional way to cause stack overflows in Haskell.
Alternately, I would try just replacing the whole makeMap
implementation with fromListWith
:
makeMap pairs = fromListWith (++) [(k, [v]) | (k, v) <- pairs]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With