Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strategies for constructing tree in parallel in Haskell

I have a project where I'm building a Decision Tree in Haskell. The generated trees will have multiple branches that are independent of each other, so I figured they could be constructed in parallel.

The DecisionTree data type is defined like so:

data DecisionTree =
    Question Filter DecisionTree DecisionTree |    
    Answer DecisionTreeResult

instance NFData DecisionTree where
    rnf (Answer dtr)            = rnf dtr
    rnf (Question fil dt1 dt2)  = rnf fil `seq` rnf dt1 `seq` rnf dt2

Here's the part of the algorithm that constructs the tree

constructTree :: TrainingParameters -> [Map String Value] -> Filter -> Either String DecisionTree    
constructTree trainingParameters trainingData fil =    
    if informationGain trainingData (parseFilter fil) < entropyLimit trainingParameters    
    then constructAnswer (targetVariable trainingParameters) trainingData    
    else
        Question fil <$> affirmativeTree <*> negativeTree `using` evalTraversable parEvalTree    
        where   affirmativeTree   = trainModel trainingParameters passedTData    
                negativeTree      = trainModel trainingParameters failedTData    
                passedTData       = filter (parseFilter fil) trainingData    
                failedTData       = filter (not . parseFilter fil) trainingData

parEvalTree :: Strategy DecisionTree    
parEvalTree (Question f dt1 dt2) = do    
    dt1' <- rparWith rdeepseq dt1    
    dt2' <- rparWith rdeepseq dt2    
    return $ Question f dt1' dt2'
parEvalTree ans = return ans

trainModel recursively calls constructTree. The relevant line for parallelism is

Question fil <$> affirmativeTree <*> negativeTree `using` evalTraversable parEvalTree 

I'm building this with the GHC flags -threaded -O2 -rtsopts -eventlog and running it with stack exec -- performance-test +RTS -A200M -N -s -l (I'm on a 2 core machine).

But it doesn't seem to run anything in parallel

SPARKS: 164 (60 converted, 0 overflowed, 0 dud, 0 GC'd, 104 fizzled)

INIT    time    0.000s  (  0.009s elapsed)
MUT     time   29.041s  ( 29.249s elapsed)
GC      time    0.048s  (  0.015s elapsed)
EXIT    time    0.001s  (  0.006s elapsed)
Total   time   29.091s  ( 29.279s elapsed)

threadscope output

I suspect there might be some issue with recursive calls with rdeepseq and the Strategy for parallelism. If some experienced Haskeller would chime in it would really make my day :)

like image 439
The Hoff Avatar asked Nov 07 '22 23:11

The Hoff


1 Answers

I am not an expert at Haskell performance/parallelism, but I think a couple of things are going on here.

Firstly, there is indeed this line:

Question fil <$> affirmativeTree <*> negativeTree `using` evalTraversable parEvalTree 

Presumably, one might expect that the first part of this line builds up a datastructure that looks like

                      +-------+
                      | Right |
                      +-------+
                          |
                    +----------+
                    | Question |
                    +----------+
                     |   |    |
   +-----------------+   |    +-----------+
   |                +----+                |
   |                |                     |
+-----+   +-------------------+   +----------------+
| fil |   |       THUNK       |   |     THUNK      |
+-----+   | (affirmativeTree) |   | (negativeTree) |
          +-------------------+   +----------------+

The evalTraversable will then see the Right and run the parEvalTree on the Question, resulting in both thunks being sparked for deep evaluation in parallel.

Unfortunately, this isn't quite what happens, and I think the issue is due to the extra Either String. In order to evaluate the Question line (even just to WHNF), as evalTraversable must, we have to figure out whether the result is going to be a Right decisonTree or a Left _. This means that affirmativeTree and negativeTree have to be evaluated to WHNF before parEvalTree can ever come into play. Unfortunately, due to the structure of your code, evaluating either tree to WHNF in this way forces pretty much everything---the filter selection has to be forced in order to see which branch the recursive constructTree call takes, and then its own recursive calls to trainModel are forced to WHNF in the same way.

This can be avoided by sparking off affirmativeTree and negativeTree separately first, and then only looking at the results in WHNF form after they've had time to be fully computed, by doing something like this:

uncurry (Question fil) <$> bisequence ((affirmativeTree, negativeTree) `using` parTuple2 rdeepseq rdeepseq)

If you run your code with this line replacing the original and load it into ThreadScope, you will see that there is clearly some increase in parallelism: the activity graph briefly goes above 1 in a few places, and execution jumps between HECs in several places. Unfortunately, the vast majority of the program's time is still spent in sequential execution.

I tried to look into this a little bit, and I think that something in your tree construction code may be a bit right-biased. I added some traceMarkers and traceEvents, and it looks like there is frequently a fairly large imbalance between the positive and negative sides of a filter, which makes parallel execution not work very well: the positive subtree tends to finish very very quickly, while the negative subtree takes a long time, creating what looks like essentially sequential execution. In several cases, the positive subtree is so small that the core that sparked the computation finishes it and then begins the negative subtree before another core can wake up to steal the work. This is where the long runs on a single core in ThreadScope come from. The short period of time with a fair bit of parallelism that you can see at the beginning of the graph is the time during which the negative subtree of the first filter is executing, since that's the main filter with a negative subtree large enough to really contribute to parallelism. There are also a few similar (but much smaller) events later in the trace where reasonably-sized negative trees are created.

I would expect that if you make the change above and try to find filters that more evenly partition the dataset, you should see a fairly large increase in the parallelizability of this code.

like image 85
Peter Amidon Avatar answered Nov 15 '22 06:11

Peter Amidon