There are some topics here that are very helpful on how to find similar pictures.
What I want to do is to get a fingerprint of a picture and find the same picture on different photos taken by a digital camera. The SURF algorithm seams to be the best way to be independent on scaling, angle and other distortions.
I'm using OpenCV with the SURF algorithm to extract features on the sample image. Now I'm wondering how to convert all this feature data (position, laplacian, size, orientation, hessian) into a fingerprint or hash.
This fingerprint will be stored in a database and a search query must be able to compare that fingerprint with a fingerprint of a photo with almost the same features.
Update:
It seems that there is no way to convert all the descriptor vectors into a simple hash. So what would be the best way to store the image descriptors into the database for fast querying?
Would Vocabulary Trees be an option?
I would be very thankful for any help.
The feature data you mention (position, laplacian, size, orientation, hessian) is insufficient for your purpose (these are actually the less relevant parts of the descriptor if you want to do matching). The data you want to look at are the "descriptors" (the 4th argument):
void cvExtractSURF(const CvArr* image, const CvArr* mask, CvSeq** keypoints, CvSeq** descriptors, CvMemStorage* storage, CvSURFParams params)
These are 128 or 64 (depending on params) vectors which contain the "fingerprints" of the specific feature (each image will contain a variable amount of such vectors). If you get the latest version of Opencv they have a sample named find_obj.cpp which shows you how it is used for matching
update:
you might find this discussion helpful
A trivial way to compute a hash would be the following. Get all the descriptors from the image (say, N of them). Each descriptor is a vector of 128 numbers (you can convert them to be integers between 0 and 255). So you have a set of N*128 integers. Just write them one after another into a string and use that as a hash value. If you want the hash values to be small, I believe there are ways to compute hash functions of strings, so convert descriptors to string and then use the hash value of that string.
That might work if you want to find exact duplicates. But it seems (since you talk about scale, rotation, etc) you want to just find "similar" images. In that case, using a hash is probably not a good way to go. You probably use some interest point detector to find points at which to compute SURF descriptors. Imagine that it will return the same set of points, but in different order. Suddenly your hash value will be very different, even if the images and descriptors are the same.
So, if I had to find similar images reliably, I'd use a different approach. For example, I could vector-quantize the SURF descriptors, build histograms of vector-quantized values, and use histogram intersection for matching. Do you really absolutely have to use hash functions (maybe for efficiency), or do you just want to use whatever to find similar images?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With