Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HMM algorithm for gesture recognition

I want to develop an app for gesture recognition using Kinect and hidden Markov models. I watched a tutorial here: HMM lecture

But I don't know how to start. What is the state set and how to normalize the data to be able to realize HMM learning? I know (more or less) how it should be done for signals and for simple "left-to-right" cases, but 3D space makes me a little confused. Could anyone describe how it should be begun?

Could anyone describe the steps, how to do this? Especially I need to know how to do the model and what should be the steps of HMM algorithm.

like image 505
Nickon Avatar asked Jan 28 '13 22:01

Nickon


People also ask

What is HMM in pattern recognition?

A hidden Markov model (HMM) is a probabilistic graphical model that is commonly used in statistical pattern recognition and classification.

How does HMM algorithm work?

The Hidden Markov model is a probabilistic model which is used to explain or derive the probabilistic characteristic of any random process. It basically says that an observed event will not be corresponding to its step-by-step status but related to a set of probability distributions.

What is hmm model explain with example?

Hidden Markov Models (HMMs) are a class of probabilistic graphical model that allow us to predict a sequence of unknown (hidden) variables from a set of observed variables. A simple example of an HMM is predicting the weather (hidden variable) based on the type of clothes that someone wears (observed).

What is HMM in image processing?

Hidden Markov models are well-known methods for image processing. They are used in many areas where 1D data are processed. In the case of 2D data, there appear some problems with application HMM.


2 Answers

One set of methods for applying HMMs to gesture recognition would be to apply a similar architecture as commonly used for speech recognition.

The HMM would not be over space but over time, and each video frame (or set of extracted features from the frame) would be an emission from an HMM state.

Unfortunately, HMM-based speech recognition is a rather large area. Many books and theses have been written describing different architectures. I recommend starting with Jelinek's "Statistical Methods for Speech Recognition" (http://books.google.ca/books?id=1C9dzcJTWowC&pg=PR5#v=onepage&q&f=false) then following the references from there. Another resource is the CMU sphinx webpage (http://cmusphinx.sourceforge.net).

Another thing to keep in mind is that HMM-based systems are probably less accurate than discriminative approaches like conditional random fields or max-margin recognizers (e.g. SVM-struct).

For an HMM-based recognizer the overall training process is usually something like the following:

1) Perform some sort of signal processing on the raw data

  • For speech this would involve converting raw audio into mel-cepstrum format, while for gestures, this might involve extracting image features (SIFT, GIST, etc.)

2) Apply vector quantization (VQ) (other dimensionality reduction techniques can also be used) to the processed data

  • Each cluster centroid is usually associated with a basic unit of the task. In speech recognition, for instance, each centroid could be associated with a phoneme. For a gesture recognition task, each VQ centroid could be associated with a pose or hand configuration.

3) Manually construct HMMs whose state transitions capture the sequence of different poses within a gesture.

  • Emission distributions of these HMM states will be centered on the VQ vector from step 2.

  • In speech recognition these HMMs are built from phoneme dictionaries that give the sequence of phonemes for each word.

4) Construct an single HMM that contains transitions between each individual gesture HMM (or in the case of speech recognition, each phoneme HMM). Then, train the composite HMM with videos of gestures.

  • It is also possible at this point to train each gesture HMM individually before the joint training step. This additional training step may result in better recognizers.

For the recognition process, apply the signal processing step, find the nearest VQ entry for each frame, then find a high scoring path through the HMM (either the Viterbi path, or one of a set of paths from an A* search) given the quantized vectors. This path gives the predicted gestures in the video.

like image 177
user1149913 Avatar answered Oct 04 '22 04:10

user1149913


I implemented the 2d version of this for the Coursera PGM class, which has kinect gestures as the final unit.

https://www.coursera.org/course/pgm

Basically, the idea is that you can't use HMM to actually decide poses very well. In our unit, I used some variation of K-means to segment the poses into probabilistic categories. The HMM was used to actually decide what sequences of poses were actually viable as gestures. But any clustering algorithm run on a set of poses is a good candidate- even if you don't know what kind of pose they are or something similar.

From there you can create a model which trains on the aggregate probabilities of each possible pose for each point of kinect data.

I know this is a bit of a sparse interview. That class gives an excellent overview of the state of the art but the problem in general is a bit too difficult to be condensed into an easy answer. (I'd recommend taking it in april if you're interested in this field)

like image 37
argentage Avatar answered Oct 04 '22 03:10

argentage