I have a project where I'm building a Decision Tree in Haskell. The generated trees will have multiple branches that are independent of each other, so I figured they could be constructed in parallel.
The DecisionTree
data type is defined like so:
data DecisionTree =
Question Filter DecisionTree DecisionTree |
Answer DecisionTreeResult
instance NFData DecisionTree where
rnf (Answer dtr) = rnf dtr
rnf (Question fil dt1 dt2) = rnf fil `seq` rnf dt1 `seq` rnf dt2
Here's the part of the algorithm that constructs the tree
constructTree :: TrainingParameters -> [Map String Value] -> Filter -> Either String DecisionTree
constructTree trainingParameters trainingData fil =
if informationGain trainingData (parseFilter fil) < entropyLimit trainingParameters
then constructAnswer (targetVariable trainingParameters) trainingData
else
Question fil <$> affirmativeTree <*> negativeTree `using` evalTraversable parEvalTree
where affirmativeTree = trainModel trainingParameters passedTData
negativeTree = trainModel trainingParameters failedTData
passedTData = filter (parseFilter fil) trainingData
failedTData = filter (not . parseFilter fil) trainingData
parEvalTree :: Strategy DecisionTree
parEvalTree (Question f dt1 dt2) = do
dt1' <- rparWith rdeepseq dt1
dt2' <- rparWith rdeepseq dt2
return $ Question f dt1' dt2'
parEvalTree ans = return ans
trainModel
recursively calls constructTree
.
The relevant line for parallelism is
Question fil <$> affirmativeTree <*> negativeTree `using` evalTraversable parEvalTree
I'm building this with the GHC flags -threaded -O2 -rtsopts -eventlog
and running it with
stack exec -- performance-test +RTS -A200M -N -s -l
(I'm on a 2 core machine).
But it doesn't seem to run anything in parallel
SPARKS: 164 (60 converted, 0 overflowed, 0 dud, 0 GC'd, 104 fizzled)
INIT time 0.000s ( 0.009s elapsed)
MUT time 29.041s ( 29.249s elapsed)
GC time 0.048s ( 0.015s elapsed)
EXIT time 0.001s ( 0.006s elapsed)
Total time 29.091s ( 29.279s elapsed)
I suspect there might be some issue with recursive calls with rdeepseq
and the Strategy for parallelism. If some experienced Haskeller would chime in it would really make my day :)
I am not an expert at Haskell performance/parallelism, but I think a couple of things are going on here.
Firstly, there is indeed this line:
Question fil <$> affirmativeTree <*> negativeTree `using` evalTraversable parEvalTree
Presumably, one might expect that the first part of this line builds up a datastructure that looks like
+-------+
| Right |
+-------+
|
+----------+
| Question |
+----------+
| | |
+-----------------+ | +-----------+
| +----+ |
| | |
+-----+ +-------------------+ +----------------+
| fil | | THUNK | | THUNK |
+-----+ | (affirmativeTree) | | (negativeTree) |
+-------------------+ +----------------+
The evalTraversable
will then see the Right
and run the parEvalTree
on the Question
, resulting in both thunks being sparked for deep evaluation in parallel.
Unfortunately, this isn't quite what happens, and I think the issue is due to the extra Either String
. In order to evaluate the Question
line (even just to WHNF), as evalTraversable
must, we have to figure out whether the result is going to be a Right decisonTree
or a Left _
. This means that affirmativeTree
and negativeTree
have to be evaluated to WHNF before parEvalTree
can ever come into play. Unfortunately, due to the structure of your code, evaluating either tree to WHNF in this way forces pretty much everything---the filter selection has to be forced in order to see which branch the recursive constructTree
call takes, and then its own recursive calls to trainModel
are forced to WHNF in the same way.
This can be avoided by sparking off affirmativeTree
and negativeTree
separately first, and then only looking at the results in WHNF form after they've had time to be fully computed, by doing something like this:
uncurry (Question fil) <$> bisequence ((affirmativeTree, negativeTree) `using` parTuple2 rdeepseq rdeepseq)
If you run your code with this line replacing the original and load it into ThreadScope, you will see that there is clearly some increase in parallelism: the activity graph briefly goes above 1 in a few places, and execution jumps between HECs in several places. Unfortunately, the vast majority of the program's time is still spent in sequential execution.
I tried to look into this a little bit, and I think that something in your tree construction code may be a bit right-biased. I added some traceMarker
s and traceEvent
s, and it looks like there is frequently a fairly large imbalance between the positive and negative sides of a filter, which makes parallel execution not work very well: the positive subtree tends to finish very very quickly, while the negative subtree takes a long time, creating what looks like essentially sequential execution. In several cases, the positive subtree is so small that the core that sparked the computation finishes it and then begins the negative subtree before another core can wake up to steal the work. This is where the long runs on a single core in ThreadScope come from. The short period of time with a fair bit of parallelism that you can see at the beginning of the graph is the time during which the negative subtree of the first filter is executing, since that's the main filter with a negative subtree large enough to really contribute to parallelism. There are also a few similar (but much smaller) events later in the trace where reasonably-sized negative trees are created.
I would expect that if you make the change above and try to find filters that more evenly partition the dataset, you should see a fairly large increase in the parallelizability of this code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With