Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exact Hidden Markov Model training algorithm

In most cases, the Baum-Welch algorithm is used to train a Hidden Markov model.

In many papers however, it is argued that the BW algorithm will optimize until it got stuck in a local optimum.

Does there exist an exact algorithm that actually succeeds in finding the global optimum (except from enumerating nearly all possible models and evaluating them)?

Of course for most applications, BW will work fine. We are however interested in finding lower bounds of the amount of information loss when reducing the number of states. Therefore we always need to generate the best model possible.

We are thus looking for an efficient NP-hard algorithm (that only enumerates over a (potentially) exponential number of extreme points) and not over a discretized number of floating points for each probability in the model.

like image 538
Willem Van Onsem Avatar asked May 19 '14 04:05

Willem Van Onsem


2 Answers

A quick search finds in http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/html/lec06/node6.html "In this case, the problem of finding the optimal set of parameters $\Theta^{\ast}$ is known to be NP-complete. The Baum-Welch algorithm [2], which is a special case of the EM technique (Expectation and Maximization), can be used for heuristically finding a solution to the problem. " Therefore I suggest that an EM variant that was guaranteed to find a global optimum in polynomial time would prove P=NP and is unknown and in fact probably does not exist.

This problem almost certainly is not convex, because there will in fact be multiple globally optimal solutions with the same scores - given any proposed solution, which typically gives a probability distribution for observations given the underlying state, you can, for instance, rename hidden state 0 as hidden state 1, and vice versa, adjusting the probability distributions so that the observed behaviour generated by the two different solutions is identical. Thus if there are N hidden states there are at least N! local optimums produced by permuting the hidden states amongst themselves.

On another application of EM, https://www.math.ias.edu/csdm/files/11-12/amoitra_disentangling_gaussians.pdf provides an algorithm for finding a globally optimum gaussian mixture model. It observes that the EM algorithm is often used for this problem, but points out that it is not guaranteed to find the global optimum, and does not reference as related work any version of EM which might (it also says the EM algorithm is slow).

If you are trying to do some sort of likelihood ratio test between e.g. a 4-state model and a 5-state model, it would clearly be embarrassing if, due to local optima, the 5-state model fitted had a lower likelihood than the 4-state model. One way to avoid this or to recover from it would be to start a 5-state EM from a starting point very close to that of the best 4-state models found. For instance, you could create a 5th state with probability epsilon and with an output distribution reflecting an average of the 4-state output distributions, keeping the 4-state distributions as the other 4 distributions in the new 5-state model, multiplying in a factor of (1-epsilon) somewhere so that everything still added up to one.

like image 98
mcdowella Avatar answered Oct 21 '22 20:10

mcdowella


I think if you really want this, you can, given a local optimum, define a domain of convergence. If you can have some reasonably weak conditions, then you can quickly show that either the whole field is in the domain of convergence, or that there is a second local mininium.

E.g., suppose in an example I have two independent variables (x,y), and one dependent variable (z), and suppose that given a local minimim z_1, and a pair of start points which converge to z_1=(x_1,y_1), P_1 = (x_2, y_1) and p_2 = (x_1, y_3), then i might be able to prove that then all of the triangle z_1, p_1, p_2 is in the domain of convergence.

Of course, this is not an approach which works generally, but you can solve a sub class of problems efficiently.E.g., some problems have no no domain of convergence in a sense, e.g. its possible to ahve a problem where a point converges to a different solution than all the points in its neighbourhood, but lots of problems have some reasonable smoothness to their convergence to a solution, so then you can do ok.

like image 20
phil_20686 Avatar answered Oct 21 '22 20:10

phil_20686