Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Monoid assist in parallel training?

The Readme of HLearn states that the Monoid typeclass is used for parallel batch training. I've seen trainMonoid mentioned in several files, but I'm having a difficulty to dissect this huge codebase. Could someone explain in beginner-friendly terms how does it work? I guess it's somehow related to the associativity property.

like image 544
dimid Avatar asked Oct 09 '16 12:10

dimid


1 Answers

It's explained in this article which is linked in the page you linked in the question. Since you want a beginner friendly description I'll give you a very high level description of what I understood after reading the article. Consider this as a rough overview of the idea, to understand exactly everything you have to study the articles.

The basic idea is to use algebraic properties to avoid re-doing the same work over and over. They do it by using the associativity of the monoidal operation and homomorphisms.

Given two sets A and B with two binary operations + and * an homomorphism is a function f: A -> B such that f(x + y) = f(x) * f(y), i.e. it's a function that preserves the structure between the two sets. In the case of that article the function f is basically the function that maps the sets of inputs to the trained models.

So the idea is that you can divide the input data into different portions x and y, and instead of having to compute the model of the whole thing as in T(x + y) you can do the training on just x and y and then combine the results: T(x) * T(y).

Now this as it is doesn't really help that much but, in training you often repeat work. For example in cross validation you, for k times, sample the data into a set of inputs for the trainer and a set of data used to test the trainer. But this means that in these k iterations you are executing T over the same portions of inputs multiple times.

Here monoids come into play: you can first split the domain into subsets, and compute T over these subsets, and then to compute the result of cross validation you can just put together the results from the corresponding subsets.

To give an idea: if the data is {1,2,3,4} and k = 3 instead of doing:

  • T({1,2}) plus test on {3, 4}
  • T({1,3}) plus test on {2, 4}
  • T({1,4}) plus test on {2, 3}

Here you can see that we trained over 1 for three times. Using the homomorphism we can compute T({1}) once and then combine the result with the other partial result to obtain the final trained model.

The correctness of the final result is assured by the associativity of the operations and the homomorphism.

The same idea can be applied when parallelizing: you divide the inputs into k groups, perform the training in parallel and then compound the results: T(x_1 + x_2 + ... + x_k) = T(x_1) * T(x_2) * ... * T(x_k) where the T(x_i) calls are performed completely in parallel and only at the end you have to compound the results.

Regarding online training algorithms, the idea is that given a "batch" training algorithm T you can make it into an online one by doing:

T_O(m, d) = m * T(d)

where m is an already trained model (which generally will be the trained model until that point) and d is the new data point you add for training.

Again the exactness of the result is due to the homorphism that tells you that if m = T(x) then m * T(d) = T(x+d), i.e. the online algorithm gives the same result of the batch algorithm with all those data points.


The more interesting (and complex) part of all of this is how can you see a training task as such a homomorphism etc. I'll leave that to your personal study.

like image 154
Bakuriu Avatar answered Nov 13 '22 20:11

Bakuriu