I am attempting to create a program that can find human figures in video of game play of call of duty. I have compiled a list of ~2200 separate images from this video that either contain a human figure or do not. I have then attempted to train a neural network to tell the difference between the two sets of images.
Then, I divide each video frame up into a couple hundred gridded rectangles and I check each with my ANN. The rectangles are overlapping to attempt to capture figures that are between grid rects, but this doesn't seem to work well. So I have a few questions:
Are neural networks the way to go? I have read that they are very fast compared to other machine learning algorithms, and eventually I plan to use this with real time video and speed is very important.
What is the best way to search for the figures in the image frame to test on the ANN? I feel like the way I do it isn't very good. It's definitely not very fast or accurate. It takes about a second per frame of an image 960 x 540 and has poor accuracy.
Another problem I have had is the best way to build the feature vector to use as the input to the ANN. Currently, I just scale all input images down to25 x 50 pixels and create a feature vector containing the intensity of every pixel. This is a very large vector (1250 floats). What are better ways to build a feature vectors?
For a more detailed explanation of what I do here: CodAI: Computer Vision
EDIT: I would like a little more detail. What is the best way to calculates features. I need to be able to recognize a human figure in many different positions. Do I need to create separate classifiers for recognizing the difference between upright, crouched, and prone?
Footnotes:
1 For the nitpickers: Without a highly complex classifier.
2 You can also employ a cascade of boosted classifiers to gain speed without giving away too much in detection rate.
This problem is too hard for a normal ANN.
ANNs aren't really very well suited to images with lots of spatial transformations (i.e. human figures in different positions). They effectively need to learn each possible position independently, since they can't generalise well over translations, rotations and scaling etc. Even if you managed to make it work, you'd probably need billions of training images and years of training time.
Your best bet is probably to go with either:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With