I'm trying to determine skeleton joints (or at the very least to be able to track a single palm) using a regular webcam. I've looked all over the web and can't seem to find a way to do so.
Every example I've found is using Kinect. I want to use a single webcam.
There's no need for me to calculate the depth of the joints - I just need to be able to recognize their X, Y position in the frame. Which is why I'm using a webcam, not a Kinect.
So far I've looked at:
I'm looking for a C/C++ library (but at this point would look at any other language), preferably open source (but, again, will consider any license) that can do the following:
Would really appreciate it if someone can help me out with this. I've been stuck on this for a few days now with no clear path to proceed.
2 years later a solution was found: http://dlib.net/imaging.html#shape_predictor
To track a hand using a single camera without depth information is a serious task and topic of ongoing scientific work. I can supply you a bunch of interesting and/or highly cited scientific papers on the topic:
Hand tracking literature survey in the 2nd chapter:
Unfortunately I don't know about some freely available hand tracking library.
there is a simple way for detecting hand using skin tone. perhaps this could help... you can see the results on this youtube video. caveat: the background shouldn't contain skin colored things like wood.
here is the code:
''' Detect human skin tone and draw a boundary around it.
Useful for gesture recognition and motion tracking.
Inspired by: http://stackoverflow.com/a/14756351/1463143
Date: 08 June 2013
# Required moduls
import cv2
import numpy
# Constants for finding range of skin color in YCrCb
min_YCrCb = numpy.array([0,133,77],numpy.uint8)
max_YCrCb = numpy.array([255,173,127],numpy.uint8)
# Create a window to display the camera feed
cv2.namedWindow('Camera Output')
# Get pointer to video frames from primary device
videoFrame = cv2.VideoCapture(0)
# Process the video frames
keyPressed = -1 # -1 indicates no key pressed
while(keyPressed < 0): # any key pressed has a value >= 0
# Grab video frame, decode it and return next video frame
readSucsess, sourceImage = videoFrame.read()
# Convert image to YCrCb
imageYCrCb = cv2.cvtColor(sourceImage,cv2.COLOR_BGR2YCR_CB)
# Find region with skin tone in YCrCb image
skinRegion = cv2.inRange(imageYCrCb,min_YCrCb,max_YCrCb)
# Do contour detection on skin region
contours, hierarchy = cv2.findContours(skinRegion, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
# Draw the contour on the source image
for i, c in enumerate(contours):
area = cv2.contourArea(c)
if area > 1000:
cv2.drawContours(sourceImage, contours, i, (0, 255, 0), 3)
# Display the source image
cv2.imshow('Camera Output',sourceImage)
# Check for user input to close program
keyPressed = cv2.waitKey(1) # wait 1 milisecond in each iteration of while loop
# Close window and camera after exiting the while loop
cv2.destroyWindow('Camera Output')
the cv2.findContour is quite useful, you can find the centroid of a "blob" by using cv2.moments after u find the contours. have a look at the opencv documentation on shape descriptors.
i havent yet figured out how to make the skeletons that lie in the middle of the contour but i was thinking of "eroding" the contours till it is a single line. in image processing the process is called "skeletonization" or "morphological skeleton". here is some basic info on skeletonization.
here is a link that implements skeletonization in opencv and c++
here is a link for skeletonization in opencv and python
hope that helps :)
--- EDIT ----
i would highly recommend that you go through these papers by Deva Ramanan (scroll down after visiting the linked page): http://www.ics.uci.edu/~dramanan/
The most common approach can be seen in the following youtube video. http://www.youtube.com/watch?v=xML2S6bvMwI
This method is not quite robust, as it tends to fail if the hand is rotated to much (eg; if the camera is looking at the side of the hand or at a partially bent hand).
If you do not mind using two camera's you can look into the work Robert Wang. His current company (3GearSystems) uses this technology, augmented with a kinect, to provide tracking. His original paper uses two webcams but has much worse tracking.
Wang, Robert, Sylvain Paris, and Jovan Popović. "6d hands: markerless hand-tracking for computer aided design." Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 2011.
Another option (again if using "more" than a single webcam is possible), is to use a IR emitter. Your hand reflects IR light quite well whereas the background does not. By adding a filter to the webcam that filters normal light (and removing the standard filter that does the opposite) you can create a quite effective hand tracking. The advantage of this method is that the segmentation of the hand from the background is much simpler. Depending on the distance and the quality of the camera, you would need more IR leds, in order to reflect sufficient light back into the webcam. The leap motion uses this technology to track the fingers & palms (it uses 2 IR cameras and 3 IR leds to also get depth information).
All that being said; I think the Kinect is your best option in this. Yes, you don't need the depth, but the depth information does make it a lot easier to detect the hand (using the depth information for the segmentation).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With