I am dealing with the computation which has as an intermediate result a list A=[B], which is a list of K lists of the length L. The time-complexity to compute an element of B is controlled by the parameter M and is theoretically linear in M. Theoretically I would expect the time-complexity for the computation of A to be O(K*L*M). However, this is not the case and I don't understand why?
Here is the simple complete sketch program which exhibits the problem I have explained
import System.Random (randoms, mkStdGen)
import Control.Parallel.Strategies (parMap, rdeepseq)
import Control.DeepSeq (NFData)
import Data.List (transpose)
type Point = (Double, Double)
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b in b * (q - fromIntegral (floor q))
standardMap :: Double -> Point -> Point
standardMap k (q, p) = (fmod (q + p) (2 * pi), fmod (p + k * sin(q)) (2 * pi))
trajectory :: (Point -> Point) -> Point -> [Point]
trajectory map initial = initial : (trajectory map $ map initial)
justEvery :: Int -> [a] -> [a]
justEvery n (x:xs) = x : (justEvery n $ drop (n-1) xs)
justEvery _ [] = []
subTrace :: Int -> Int -> [a] -> [a]
subTrace n m = take (n + 1) . justEvery m
ensemble :: Int -> [Point]
ensemble n = let qs = randoms (mkStdGen 42)
ps = randoms (mkStdGen 21)
in take n $ zip qs ps
ensembleTrace :: NFData a => (Point -> [Point]) -> (Point -> a) ->
Int -> Int -> [Point] -> [[a]]
ensembleTrace orbitGen observable n m =
parMap rdeepseq ((map observable . subTrace n m) . orbitGen)
main = let k = 100
l = 100
m = 100
orbitGen = trajectory (standardMap 7)
observable (p, q) = p^2 - q^2
initials = ensemble k
mean xs = (sum xs) / (fromIntegral $ length xs)
result = (map mean)
$ transpose
$ ensembleTrace orbitGen observable l m
$ initials
in mapM_ print result
I am compiling with
$ ghc -O2 stdmap.hs -threaded
and runing with
$ ./stdmap +RTS -N4 > /dev/null
on the intel Q6600, Linux 3.6.3-1-ARCH, with GHC 7.6.1 and get the following results for the different sets of the parameters K, L, M (k, l, m in the code of the program)
(K=200,L=200,N=200) -> real 0m0.774s
user 0m2.856s
sys 0m0.147s
(K=2000,L=200,M=200) -> real 0m7.409s
user 0m28.102s
sys 0m1.080s
(K=200,L=2000,M=200) -> real 0m7.326s
user 0m27.932s
sys 0m1.020s
(K=200,L=200,M=2000) -> real 0m10.581s
user 0m38.564s
sys 0m3.376s
(K=20000,L=200,M=200) -> real 4m22.156s
user 7m30.007s
sys 0m40.321s
(K=200,L=20000,M=200) -> real 1m16.222s
user 4m45.891s
sys 0m15.812s
(K=200,L=200,M=20000) -> real 8m15.060s
user 23m10.909s
sys 9m24.450s
I don't quite understand where the problem of such a pure scaling might be. If I understand correctly the lists are lazy and should not be constructed, since they are consumed in the head-tail direction? As could be observed from the measurements there is a correlation between the excessive real-time consumption and the excessive system-time consumption as the excess would be on the system account. But if there is some memory management wasting time, this should still scale linearly in K, L, M.
Help!
EDIT
I made changes in the code according to the suggestions given by Daniel Fisher, which indeed solved the bad scaling with respect to M. As pointed out, by forcing the strict evaluation in the trajectory, we avoid the construction of large thunks. I understand the performance improvement behind that, but I still don't understand the bad scaling of the original code, because (if I understand correctly) the space-time-complexity of the construction of the thunk should be linear in M?
Additionally, I still have problems understanding the bad scaling with respect to K (the size of the ensemble). I performed two additional measurements with the improved code for K=8000 and K=16000, keeping L=200, M=200. Scaling up to K=8000 is as expected but for K=16000 it is already abnormal. The problem seems to be in the number of overflowed
SPARKS
, which is 0 for K=8000 and 7802 for K=16000. This probably reflects in the bad concurrency which I quantify as a quotient Q = (MUT cpu time) / (MUT real time)
which would be ideally equal to the number of CPU-s. However, Q ~ 4 for K = 8000 and Q ~ 2 for K = 16000.
Please help me understand the origin of this problem and the possible solutions.
K = 8000:
$ ghc -O2 stmap.hs -threaded -XBangPatterns
$ ./stmap +RTS -s -N4 > /dev/null
56,905,405,184 bytes allocated in the heap
503,501,680 bytes copied during GC
53,781,168 bytes maximum residency (15 sample(s))
6,289,112 bytes maximum slop
151 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 27893 colls, 27893 par 7.85s 1.99s 0.0001s 0.0089s
Gen 1 15 colls, 14 par 1.20s 0.30s 0.0202s 0.0558s
Parallel GC work balance: 23.49% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 8000 (8000 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 95.90s ( 24.28s elapsed)
GC time 9.04s ( 2.29s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 104.95s ( 26.58s elapsed)
Alloc rate 593,366,811 bytes per MUT second
Productivity 91.4% of total user, 360.9% of total elapsed
gc_alloc_block_sync: 315819
and
K = 16000:
$ ghc -O2 stmap.hs -threaded -XBangPatterns
$ ./stmap +RTS -s -N4 > /dev/null
113,809,786,848 bytes allocated in the heap
1,156,991,152 bytes copied during GC
114,778,896 bytes maximum residency (18 sample(s))
11,124,592 bytes maximum slop
300 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 135521 colls, 135521 par 22.83s 6.59s 0.0000s 0.0190s
Gen 1 18 colls, 17 par 2.72s 0.73s 0.0405s 0.1692s
Parallel GC work balance: 18.05% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 16000 (8198 converted, 7802 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 221.77s (139.78s elapsed)
GC time 25.56s ( 7.32s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 247.34s (147.10s elapsed)
Alloc rate 513,176,874 bytes per MUT second
Productivity 89.7% of total user, 150.8% of total elapsed
gc_alloc_block_sync: 814824
M. A. D.'s point about fmod
is a good one, but it is not necessary to call out to C, and we can do better staying in Haskell land (the ticket the linked thread was about is meanwhile fixed). The trouble in
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b in b * (q - fromIntegral (floor q))
is that type defaulting leads to floor :: Double -> Integer
(and consequently fromIntegral :: Integer -> Double
) being called. Now, Integer
is a comparatively complicated type, with slow operations, and the conversion from Integer
to Double
is also relatively complicated. The original code (with parameters k = l = 200
and m = 5000
) produced the stats
./nstdmap +RTS -s -N2 > /dev/null
60,601,075,392 bytes allocated in the heap
36,832,004,184 bytes copied during GC
2,435,272 bytes maximum residency (13741 sample(s))
887,768 bytes maximum slop
9 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46734 colls, 46734 par 41.66s 20.87s 0.0004s 0.0058s
Gen 1 13741 colls, 13740 par 23.18s 11.62s 0.0008s 0.0041s
Parallel GC work balance: 60.58% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 34.99s ( 17.60s elapsed)
GC time 64.85s ( 32.49s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 99.84s ( 50.08s elapsed)
Alloc rate 1,732,048,869 bytes per MUT second
Productivity 35.0% of total user, 69.9% of total elapsed
on my machine (-N2
because I have only two physical cores). Simply changing the code to use a type signature floor q :: Int
brings that down to
./nstdmap +RTS -s -N2 > /dev/null
52,105,495,488 bytes allocated in the heap
29,957,007,208 bytes copied during GC
2,440,568 bytes maximum residency (10481 sample(s))
893,224 bytes maximum slop
8 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 36979 colls, 36979 par 32.96s 16.51s 0.0004s 0.0066s
Gen 1 10481 colls, 10480 par 16.65s 8.34s 0.0008s 0.0018s
Parallel GC work balance: 68.64% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.01s ( 0.01s elapsed)
MUT time 29.78s ( 14.94s elapsed)
GC time 49.61s ( 24.85s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 79.40s ( 39.80s elapsed)
Alloc rate 1,749,864,775 bytes per MUT second
Productivity 37.5% of total user, 74.8% of total elapsed
a reduction of about 20% in elapsed time, 13% in MUT time. Not bad. If we look at the code for floor
that you get with optimisations, we can see why:
floorDoubleInt :: Double -> Int
floorDoubleInt (D# x) =
case double2Int# x of
n | x <## int2Double# n -> I# (n -# 1#)
| otherwise -> I# n
floorDoubleInteger :: Double -> Integer
floorDoubleInteger (D# x) =
case decodeDoubleInteger x of
(# m, e #)
| e <# 0# ->
case negateInt# e of
s | s ># 52# -> if m < 0 then (-1) else 0
| otherwise ->
case TO64 m of
n -> FROM64 (n `uncheckedIShiftRA64#` s)
| otherwise -> shiftLInteger m e
floor :: Double -> Int
just uses the machine conversion, while floor :: Double -> Integer
needs an expensive decodeDoubleInteger
and more branches. But where floor
is called here, we know that all involved Double
s are nonnegative, hence floor
is the same as truncate
, which maps directly to the machine conversion double2Int#
, so let's try that instead of floor
:
INIT time 0.00s ( 0.00s elapsed)
MUT time 29.29s ( 14.70s elapsed)
GC time 49.17s ( 24.62s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 78.45s ( 39.32s elapsed)
a really small reduction (to be expected, the fmod
isn't really the bottleneck). For comparison, calling out to C:
INIT time 0.01s ( 0.01s elapsed)
MUT time 31.46s ( 15.78s elapsed)
GC time 54.05s ( 27.06s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 85.52s ( 42.85s elapsed)
is a bit slower (unsurprisingly, you can execute a number of primops in the time calling out to C takes).
But that's not where the big fish swim. The bad thing is that picking only every m
-th element of the trajectories leads to large thunks that cause a lot of allocation and take long to evaluate when the time comes. So let's eliminate that leak and make the trajectories strict:
{-# LANGUAGE BangPatterns #-}
trajectory :: (Point -> Point) -> Point -> [Point]
trajectory map !initial@(!a,!b) = initial : (trajectory map $ map initial)
That reduces the allocations and GC time drastically, and as a consequence also the MUT time:
INIT time 0.00s ( 0.00s elapsed)
MUT time 21.83s ( 10.95s elapsed)
GC time 0.72s ( 0.36s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 22.55s ( 11.31s elapsed)
with the original fmod
,
INIT time 0.00s ( 0.00s elapsed)
MUT time 18.26s ( 9.18s elapsed)
GC time 0.58s ( 0.29s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 18.84s ( 9.47s elapsed)
with floor q :: Int
, and within measuring accuracy the same times for truncate q :: Int
(the allocation figures are a bit lower for truncate
).
The problem seems to be in the number of overflowed SPARKS, which is 0 for K=8000 and 7802 for K=16000. This probably reflects in the bad concurrency
Yes (though as far as I know the more correct term here would be parallelism instead of concurrency), there is a spark pool, and when that's full, any further sparks are not scheduled for being evaluated in whatever thread next has time when its turn comes, the computation is then done without parallelism, from the parent thread. In this case that means after an initial parallel phase, the computation falls back to sequential.
The size of the spark pool is apparently about 8K (2^13).
If you watch the CPU load via top, you will see that it drops from (close to 100%)*(number of cores)
to a much lower value after a while (for me, it was ~100% with -N2
and ~130% with -N4
).
The cure is to avoid sparking too much, and letting each spark do some more work. With the quick-and-dirty modification
ensembleTrace orbitGen observable n m =
withStrategy (parListChunk 25 rdeepseq) . map ((map observable . subTrace n m) . orbitGen)
I'm back to 200% with -N2
for practically the entire run and a good productivity,
INIT time 0.00s ( 0.00s elapsed)
MUT time 57.42s ( 29.02s elapsed)
GC time 5.34s ( 2.69s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 62.76s ( 31.71s elapsed)
Alloc rate 1,982,155,167 bytes per MUT second
Productivity 91.5% of total user, 181.1% of total elapsed
and with -N4
it's also fine (even a wee bit faster on the wall-clock - not much because all threads do basically the same, and I have only 2 physical cores),
INIT time 0.00s ( 0.00s elapsed)
MUT time 99.17s ( 26.31s elapsed)
GC time 16.18s ( 4.80s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 115.36s ( 31.12s elapsed)
Alloc rate 1,147,619,609 bytes per MUT second
Productivity 86.0% of total user, 318.7% of total elapsed
since now the spark pool doesn't overflow.
The proper fix is to make the size of the chunks a parameter that is computed from the number of trajectories and available cores so that the number of sparks doesn't exceed the pool size.
After doing some quick profiling I found that these are the serial offenders:
ghc --make -O2 MainNonOpt.hs -threaded -prof -auto-all -caf-all -fforce-recomp
./MainNonOpt +RTS -N4 -p > /dev/null
>>>
COST CENTRE MODULE %time %alloc
fmod Main 46.3 33.3
standardMap Main 28.5 0.0
trajectory Main 23.8 66.6
What's surprising about fmod is the large number of allocations it does considering it is mostly a numerical function. So the next step would be to annotate fmod to see where is the problem:
fmod :: Double -> Double -> Double
fmod a b | a < 0 = {-# SCC "negbranch" #-} b - fmod (abs a) b
| otherwise = {-# SCC "posbranch" #-} if a < b then a
else let q = {-# SCC "division" #-} a / b in {-# SCC "expression" #-} b * (q - {-# SCC "floor" #-} fromIntegral (floor q))
This gives us:
ghc --make -O2 MainNonOpt.hs -threaded -prof -caf-all -fforce-recomp
./MainNonOpt +RTS -N4 -p > /dev/null
COST CENTRE MODULE %time %alloc
MAIN MAIN 61.5 70.0
posbranch Main 16.6 0.0
floor Main 14.9 30.0
expression Main 4.5 0.0
negbranch Main 1.9 0.0
So the bit with floor
is the one which causes the issues. After looking around it turns out that the Prelude does not implement some Double RealFrac functions as best as it should(see here), probably causing some boxing/unboxing.
So by following the advice from the link I used a modified version of floor which also made the call to fromIntegral unnecessary:
floor' :: Double -> Double
floor' x = c_floor x
{-# INLINE floor' #-}
foreign import ccall unsafe "math.h floor"
c_floor :: Double -> Double
fmod :: Double -> Double -> Double
fmod a b | a < 0 = {-# SCC "negbranch" #-} b - fmod (abs a) b
| otherwise = {-# SCC "posbranch" #-} if a < b then a
else let q = {-# SCC "division" #-} a / b in {-# SCC "expression" #-} b * (q - ({-# SCC "floor" #-} floor' q))
EDIT: As Daniel Fisher Points out, there is no need to inline C code to improve the performance. An analogous Haskell function already exists. I'll leave the answer anyway, for further reference.
This does make a difference. On my machine, for k=l=200, M=5000 here are the number for the non-optimized and the optimized version:
Non optimized:
real 0m20.635s
user 1m17.321s
sys 0m4.980s
Optimized:
real 0m14.858s
user 0m55.271s
sys 0m3.815s
The trajectory
function may have similar problems and you can use profiling like it was used above to pin-point the issue.
A great starting point for profiling in Haskell can be found in this chapter of Real World Haskell.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With