I have developed a proof of concept system for sound recognition using mfcc and hidden markov models. It gives promising results when I test the system on known sounds. Although the system, when an unknown sound is inputted returns the result with the closest match and the score is not that distinct to devise it is an unknown sound e.g.:
I have trained 3 hidden markov models one for speech, one for water coming out of water tap and one for knocking on the desk. Then I test them on unseen data and get following results:
input: speech
HMM\knocking: -1213.8911146444477
HMM\speech: -617.8735676792728
HMM\watertap: -1504.4735097322673
So highest score speech which is correct
input: watertap
HMM\knocking: -3715.7246152783955
HMM\speech: -4302.67960438553
HMM\watertap: -1965.6149147201534
So highest score watertap which is correct
input: knocking
HMM\filler -806.7248912250212
HMM\knocking: -756.4428782636676
HMM\speech: -1201.686687761133
HMM\watertap: -3025.181144273698
So highest score knocking which is correct
input: unknown
HMM\knocking: -4369.1702184688975
HMM\speech: -5090.37122832872
HMM\watertap: -7717.501505674925
Here the input is an unknown sound but it still returns the closest match as there is no system for thresholding/garbage filtering.
I know that in keyword spotting an OOV (out of vocabulary) sound can be filtered out using a garbage or filler model but it says it is trained using a finite set of unknown words where this can't be applied to my system as I don't know all the sounds that the system may record.
How is a similar problem solved in speech recognition system? And how can I solve my problem to avoid false positives?
The Hidden Markov model is a probabilistic model which is used to explain or derive the probabilistic characteristic of any random process. It basically says that an observed event will not be corresponding to its step-by-step status but related to a set of probability distributions.
HMM provides solution of three problems : evaluation, decoding and learning to find most likelihood classification.
Disadvantages of HMMHMM is only dependent on every state and its corresponding observed object: The sequence labeling, in addition to having a relationship with individual words, also relates to such aspects as the observed sequence length, word context and others.
A hidden Markov model is an extension of a Markov model in which both the transitions between states and the observations emitted from each state are probabilistic. In this case, the probability of emitting observation ot, at state s is referred to using the notation bs(ot).
To reject other words you need a filler model.
This is a statistical hypothesis test. You have two hypothesis (word is known and word is unknown). To make a decision you need to estimate a probability of each hypothesis.
Filler model is trained from the speech you have, just in a different way, for example it might be a single gaussian for any speech sound. You compare score from generic filler model and score from the word HMM and make a decision. For more in-depth information and advanced algorithms you can check any paper on keyword spotting. This thesis have a good review:
ACOUSTIC KEYWORD SPOTTING IN SPEECH WITH APPLICATIONS TO DATA MINING A. J. Kishan Thambiratnam
http://eprints.qut.edu.au/37254/1/Albert_Thambiratnam_Thesis.pdf
So what I have done is: I created my simplified version of a filler model. Each hmm representing watertap sound, knocking sound and speech sound is a seperate 6 state hmm trained by sounds from training set of 30, 50, 90 sounds respectively of various lengths 0.3 sec to 10 seconds. Then I created a filler model which is a 1 state hmm consisting od all the training set sounds for knocking, watertap and speech. So if the hmm model score is greater for a given sound than the filler's score - sound is recognized otherwise it is an unknown sound. I don't really have large data but I have perfoormed a following test for false positives rejection and true positives rejection on unseen sounds.
true positives rejection
knocking 1/11 = 90% accuracy
watertap 1/9 = 89% accuracy
speech 0/14 = 100% accuracy
false positives rejection
Tested 7 unknown sounds
6/7 = 86% accuracy
So from this quick test I can conclude that this approach gives reasonable results although I have a strange feeling it may not be enough.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With