I am using the SURF algorithm in C# (OpenSurf) to get a list of interest points from an image. Each of these interest points contains a vector of descriptors , an x coordinate (int), an y coordinate (int), the scale (float) and the orientation (float).
Now, i want to compare the interest points from one image to a list of images in a database which also have a list of interest points, to find the most similar image. That is: [Image(I.P.)] COMPARETO [List of Images(I.P.)]. => Best match. Comparing the images on an individual basis yields unsatisfactory results.
When searching stackoverflow or other sites, the best solution i have found is to build an FLANN index while at the same time keeping track of where the interest points comes from. But before implementation, I have some questions which puzzle me:
1) When matching images based on their SURF interest points an algorithm I have found does the matching by comparing their distance (x1,y1->x2,y2) with each other and finding the image with the lowest total distance. Are the descriptors or orientation never used when comparing interest points?
2) If the descriptors are used, than how do i compare them? I can't figure out how to compare X vectors of 64 points (1 image) with Y vectors of 64 points (several images) using a indexed tree.
I would really appreciate some help. All the places I have searched or API I found, only support matching one picture to another, but not to match one picture effectively to a list of pictures.
There are multiple things here.
In order to know two images are (almost) equal, you have to find the homographic projection of the two such that the projection results in a minimal error between the projected feature locations. Brute-forcing that is possible but not efficient, so a trick is to assume that similar images tend to have the feature locations in the same spot as well (give or take a bit). For example, when stitching images, the image to stitch are usually taken only from a slightly different angle and/or location; even if not, the distances will likely grow ("proportionally") to the difference in orientation.
This means that you can - as a broad phase - select candidate images by finding k
pairs of points with minimum spatial distance (the k
nearest neighbors) between all pairs of images and perform homography only on these points. Only then you compare the projected point-pairwise spatial distance and sort the images by said distance; the lowest distance implies the best possible match (given the circumstances).
If I'm not mistaken, the descriptors are oriented by the strongest angle in the angle histogram. Theat means you may also decide to take the euclidean (L2) distance of the 64- or 128-dimensional feature descriptors directly to obtain the actual feature-space similarity of two given features and perform homography on the best k
candidates. (You will not compare the scale in which the descriptors were found though, because that would defeat the purpose of scale invariance.)
Both options are time consuming and direcly depend on the number of images and features; in other word's: stupid idea.
A neat trick is to not use actual distances at all, but approximate distances instead. In other words, you want an approximate nearest neighbor algorithm, and FLANN (although not for .NET) would be one of them.
One key point here is the projection search algorithm. It works like this:
Assuming you want to compare the descriptors in 64-dimensional feature space. You generate a random 64-dimensional vector and normalize it, resulting in an arbitrary unit vector in feature space; let's call it A
. Now (during indexing) you form the dot product of each descriptor against this vector. This projects each 64-d vector onto A
, resulting in a single, real number a_n
. (This value a_n
represents the distance of the descriptor along A
in relation to A
's origin.)
This image I borrowed from this answer on CrossValidated regarding PCA demonstrates it visually; think about the rotation as the result of different random choices of A
, where the red dots correspond to the projections (and thus, scalars a_n
). The red lines show the error you make by using that approach, this is what makes the search approximate.
You will need A
again for search, so you store it. You also keep track of each projected value a_n
and the descriptor it came from; furthermore you align each a_n
(with a link to its descriptor) in a list, sorted by a_n
.
To clarify using another image from here, we're interested in the location of the projected points along the axis A
:
The values a_0 .. a_3
of the 4 projected points in the image are approximately sqrt(0.5²+2²)=1.58
, sqrt(0.4²+1.1²)=1.17
, -0.84
and -0.95
, corresponding to their distance to A
's origin.
If you now want to find similar images, you do the same: Project each descriptor onto A
, resulting in a scalar q
(query). Now you go to the position of q
in the list and take the k
surrounding entries. These are your approximate nearest neighbors. Now take the feature-space distance of these k
values and sort by lowest distance - the top ones are your best candidates.
Coming back to the last picture, assume the topmost point is our query. It's projection is 1.58
and it's approximate nearest neighbor (of the four projected points) is the one at 1.17
. They're not really close in feature space, but given that we just compared two 64-dimensional vectors using only two values, it's not that bad either.
You see the limits there and, similar projections do not at all require the original values to be close, this will of course result in rather creative matches. To accomodate for this, you simply generate more base vectors B
, C
, etc. - say n
of them - and keep track of a separate list for each. Take the k
best matches on all of them, sort that list of k*n
64-dimensional vectors according to their euclidean distance to the query vector, perform homography on the best ones and select the one with the lowest projection error.
The neat part about this is that if you have n
(random, normalized) projection axes and want to search in 64-dimensional space, you are simply multiplying each descriptor with a n x 64
matrix, resulting in n
scalars.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With