OpenCV has the matchTemplate()
function, which operates by sliding the template input across the output, and generating an array output corresponding to the match.
Where can I learn more about how to interpret the six TemplateMatchModes?
I've read through and implemented code based on the tutorial, but other than understanding that one looks for minimum results for TM_SQDIFF
for a match and maximums for the rest, I don't know how to interpret the different approaches, and the situations where one would choose one over another.
For example (taken from the tutorial)
res = cv.matchTemplate(img_gray, template, cv.TM_CCOEFF_NORMED)
threshold = 0.8
loc = np.where(res >= threshold)
and
R(x,y)= ∑x′,y′ (T′(x′,y′) ⋅ I′(x+x′,y+y′))
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−-------------
sqrt(∑x′,y′ T′(x′,y′)^2 ⋅ ∑x′,y′ I′(x+x′,y+y′)^2)
(taken from the doc page; not sure how to do equation formatting)
I would infer that TM_CCOEFF_NORMED
would return values between 0 and 1, and that the 0.8 threshold is arbitrary, but that is just supposition.
Are there deeper dives into the equations online, measurements of performance against standard datasets, or academic papers about the different modes and when and why to use one over another?
Template matching is the process of moving the template over the entire image and calculating the similarity between the template and the covered window on the image. Template matching is implemented through twodimensional convolution.
Template matching is an approach for signal pattern recognition, often used for biomedical signals including electroencephalogram (EEG).
Template matching works by "sliding" the template across the original image. As it slides, it compares or matches the template to the portion of the image directly under it. It does this matching by calculating a number. This number denotes the extent to which the template and the portion of the original are equal.
The difficulty with template matching as a model for perception is that contexts are rarely constrained. – They are not inherently view invariant. For every different possible view, there would have to be a different template (replication). As such, template representations are uneconomical.
All of the template matching modes can be classified roughly as a dense (meaning pixel-wise) similarity metric, or equivalently but inversely, a distance metric between images.
Generally, you will have two images and you want to compare them in some way. Off the bat, template matching doesn't directly help you match things that are scaled, rotated, or warped. Template matching is strictly concerned with measuring the similarity of two images exactly as they appear. However, the actual metrics used here are used everywhere in computer vision, including finding transformations between images...just usually there's more complex steps going on in addition (like gradient descent to find the optimal transformation parameters).
There are many choices for distance metrics, and they generally have pros and cons depending on the application.
For a first start, the most basic distance metric is just the absolute difference between two values, i.e. d(x, y) = abs(x - y)
. For images, an easy way to extend this from single values is just to sum all of these distances, pixel-wise, leading to the sum of absolute differences (SAD) metric; it is also known as the Manhattan or the taxicab distance, and defines the L1 norm. Annoyingly, this isn't implemented as one of OpenCV's template matching modes, but it's still important in this discussion as a comparison to SSD.
In the template matching scenario, you slide a template along multiple places and simply find where the smallest difference occurs. It is the equivalent to asking what the index of the closest value to 5 is in the array [1, 4, 9]. You take the absolute difference of each value in the array with 5, and index 1 has the smallest difference, so that's the location of the closest match. Of course in template matching the value isn't 5 but an array, and the image is a larger array.
TM_SQDIFF
An interesting feature of the SAD metric is that it doesn't penalize really big differences any more than a bunch of really small differences. Let's say we want to compute d(a, b)
and d(a, c)
with the following vectors:
a = [1, 2, 3]
b = [4, 5, 6]
c = [1, 2, 12]
Taking the sums of absolute differences element-wise, we see
SAD(a, b) = 3 + 3 + 3 = 9 = 0 + 0 + 9 = SAD(a, c)
In some applications, maybe that doesn't matter. But in other applications, you might want these two distances to actually be quite different. Squaring the differences, instead of taking their absolute value, penalizes values that are further from what you expect---it makes the images more distant as the difference in value grows. It maps more to how someone might explain an estimate as being way off, even if in value it's not actually that distant. The sum of square differences (SSD) is equivalent to the squared Euclidean distance, the distance function for the L2 norm. With SSD, we see our two distances are now quite different:
SSD(a, b) = 3^2 + 3^2 + 3^2 = 27 != 81 = 0^2 + 0^2 + 9^2 = SSD(a, c)
You may see that the L1 norm is sometimes called a robust norm. This is specifically because a single point of error won't grow the distance more than the error itself. But of course with SSD, an outlier will make the distance much larger. So if your data is somewhat prone to a few values that are very distant, note that SSD is probably not a good similarity metric for you. A good example might be comparing images that may be overexposed. In some part of the image, you may just have white sky where the other is not white at all, and you'll get a massive distance between images from that.
Both SAD and SSD have a minimum distance of 0, when the two images compared are identical. They're both always non-negative since the absolute differences or square differences are always non-negative.
TM_CCORR
SAD and SSD are both generally discrete metrics---so they're a natural consideration for sampled signals, like images. Cross correlation however is applicable as well to continuous, and therefore analog, signals, which is part of its ubiquity in signal processing. With signals broadly, trying to detect the presence of a template inside a signal is known as a matched filter, and you can basically think of it as the continuous analog of template matching.
Cross correlation just multiplies the two images together. You can imagine that if the two signals line up exactly, multiplying them together will simply square the template. If they're not lined up just-so, then the product will be smaller. So, the location where the product is maximized is where they line up the best. However, there is a problem with cross correlation in the case when you're using it as a similarity metric of signals you're not sure are related, and that is usually shown in the following example. Suppose you have three arrays:
a = [2, 600, 12]
b = [v, v, v]
c = [2v, 2v, 2v]
Broadly, there's no obvious correlation between a
and b
nor a
and c
. And generally, a
shouldn't correlate any more to b
than to c
. But, it's a product, and thus ccorr(a, c) = 2*ccorr(a, b)
. So, thats not ideal for trying to find a template inside a larger image. And because we're dealing with discrete digital signals that have a defined maximum value (images), that means that a bright white patch of the image will basically always have the maximum correlation. Because of this issues, TM_CCORR
is not particularly useful as a template matching method.
TM_CCOEFF
One simple way to solve the problem of correlating with bright patches is to simply subtract off the mean before comparing the signals. That way, signals that are simply shifted have the same correlation as those that are unshifted. And this makes sense with our intuition---signals that vary together are correlated.
TM_SQDIFF_NORMED
, TM_CCORR_NORMED
, TM_CCOEFF_NORMED
All of the methods in OpenCV are normalized the same. The point of normalization is not to give a confidence/probability, but to give a metric that you can compare against templates of different sizes or with values at different scales. For example, let's say we want to find if an object is in an image, and we have two different templates of this object. The two different templates are different sizes. We could just normalize by the number of pixels, which would work to compare templates of different sizes. However, say my templates are actually quite different in intensities, like one has much higher variance of the pixel values than the other. Typically, what you'd do in this case is divide by the standard deviation (square root of the sum of squared differences from the mean). OpenCV does do this with the TM_CCOEFF_NORMED
method, since the squared sum of the mean differences is the variance, but the other methods aren't mean shifted, so the scaling is just a measure of sum of the image values. Either way, the result is similar, you want to scale by something that relates to the intensity of the image patches used.
There are other useful metrics that OpenCV does not provide. Matlab provides SAD, as well as the maximum absolute difference metric (MaxAD), which is also known as the uniform distance metric and gives the L∞ norm. Basically, you take the max absolute difference instead of the sum of them. Other metrics that are used are typically seen in optimization settings, for example the enhanced correlation coefficient which was first proposed for stereo matching, and then later expanded for alignment in general. That method is used in OpenCV, but not for template matching; you'll find the ECC metric in computeECC()
and findTransformECC()
.
Most often, you will see normed and un-normed SSD (TM_SQDIFF_NORMED
, TM_SQDIFF
), and zero-normalized cross-correlation / ZNCC (TM_CCOEFF_NORMED
) used. Sometimes you may see TM_CCORR_NORMED
, but less often. According to some lecture notes I found online (some nice examples and intuition there on this topic!), Trucco and Verri's CV book states that generally SSD works better than correlation, but I don't have T&V's book to see why they suggest that; presumably the comparison is on real-world photographs. But despite that, SAD and SSD are definitely useful, especially on digital images.
I don't know of any definitive examples of one or the other being inherently better in most cases or something---I think it really depends on your imagery and template. Generally I'd say: if you're looking for exact or very close to exact matches, use SSD. It is fast, and it definitely maps to what you're trying to minimize (the difference between the template and image patch). There's no need to normalize in that case, it is just added overhead. If you have similar requirements but need multiple templates to be comparable, then normalize the SSD. If you're looking for matches, but you're working with real-world photographs that may have exposure or contrast differences, the mean shifting and variance equalization from ZNCC will likely be the best.
As for picking the right threshold, the value from ZNCC or SSD is not a confidence or probability number at all. If you want to pick the right threshold, you can measure the parameter in any number of typical ways. You can calculate ROC curves or PR curves for different thresholds. You can use regression to find the optimal parameter. You'll need to label some data, but then at least you'll have measurements of how you're doing against some test set so that your choice is not arbitrary. As usual with a data-filled field, you'll need to make sure your data is as close to real world examples as possible, and that your test data covers your edge cases as well as your typical images.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With