I constructed a couple of Hidden Markov Models using the Baum-Welch algorithm for an increasing number of states. I noticed that after 8 states, the validation score goes down for more than 8 states. So I wondered whether it's possible that the accuracy of an Hidden Markov Model can decrease with an increasing number of states, due to some kind of overfitting?
Thanks in advance!
HMM provides solution of three problems : evaluation, decoding and learning to find most likelihood classification.
Under the Markov assumption and the independence assumption, an HMM only needs two set of parameters to model P(st|st−1) and P(ot|st). We call these, respectively, the transition probabilities and the emission probabilities.
2. How does the state of the process is described in HMM? Explanation: An HMM is a temporal probabilistic model in which the state of the process is described by a single discrete random variable.
For the sake of clarity, I propose here a very simplified illustration of the phenomenon.
Say you train your HMM with the data sequence (A-B-A-B). Say you use a 2-state HMM. Naturally, state 1 will optimize itself to represent A and state 2 will represent B (or the opposite). Then, you have a new sequence (A-B-A-B). You want to know the likelihood this sequence has with respect to your HMM. A Viterbi algorithm will find that the most probable state sequence is (1-2-1-2), and the Baum-Welch algorithm will give this sequence a high likelihood as the sequence of states and the "values" of the new sequence (if working with continuous data) clearly match your training sequence.
Say now that you train a 3-state HMM with the same training sequence (A-B-A-B). The initial clustering of your data will most probably either assign the 2 first states of the HMM for the representation of symbol A, and the last one to symbol B (or the opposite once again).
So now, the query sequence (A-B-A-B) can be represented as the state sequence (1-3-1-3) or (2-3-2-3) or (1-3-2-3) or (2-3-1-3) ! This means that for this 3-state HMM, two identical sequences (A-B-A-B) can have low similarity for the HMM. That is the reason why for any HMM and any data set, beyond a certain number of states, performance will decrease.
You can estimate the optimal number of states using criteria such as the Bayesian Information Criterion, the Akaike Information Criterion, the Minimum Message Length Criterion, or if you just want to get a blur idea, a k-means clustering combined with the percentage of variance explained. The 3 first criteria are interesting because they include a penalty term that goes with the number of parameters of the model.
Hope it helps! :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With