Playing around with finding stuff on a graphical screen, I'm currently at a loss about how to find a given shape within an image. The shape in the image could have a different scale and will be at some unknown x,y offset, of course.
Aside from pixel artifacts resulting from different scales, there is also a little noise in both images, so I need a somewhat tolerant search.
Here's the image I am looking for.
It should show up somewhere in a screen dump of my (dual) screen buffer, roughly 3300 x 1200 pixels in size. I'd of course expect to find it in a browser window, but that information shouldn't be necessary.
The object of this exercise (so far) is to come up with a result that says:
(x1,y1)
to (x2,y2)
.I would like to be robust against scaling and the noise that's likely to be introduced by dithering. On the other hand, I can rule out some of the usual CV challenges, such as rotation or non-rigidity. That frame shape is dead easy for the human brain to discern, how hard can it be for a dedicated piece of software? This is an Adobe Flash application, and until recently I had thought that perceiving the images from a game GUI should be easy as pie.
I'm looking for an algorithm that is able to find the x,y translation at which the greatest possible overlap between the needle and haystack occur, and if possible without having to be iterated through a series of possible scale factors. Ideally, the algorithm could abstract out the "shape-ness" of the images in a way that's independent of scale.
I've read some interesting things about Fourier Transforms to accomplish something similar: Given a target image at the same scale, FFT and some matrix math yielded up the points in the bigger image that corresponded to the search pattern. But I don't have the theoretical background to put this into practice, nor do I know if this approach will gracefully handle the scale problem. Help would be appreciated!
Technology: I'm programming in Clojure/Java but could adapt algorithms in other languages. I think I should be able to interface with libraries that follow C calling conventions but I would prefer a pure Java solution.
You may be able to understand why I've shied away from presenting the actual image. It's just a silly game, but the task of screen-reading it is proving much more challenging than I had thought.
I'm obviously able to do an exhaustive search of my screen buffer for the very pixels (excluding the black) that make up my image, and that even runs in under a minute. But my ambition was to find that wooden frame using a technique that would match the shape regardless of differences that might arise from scaling and dithering.
Dithering, in fact, is one of many frustrations I'm having with this project. I've been working on extracting some useful vectors by edge extraction, but edges are woefully elusive because the pixels of any given area have widely inconsistent colors - so it's hard to tell real edges from local dithering artifacts. I had no idea that such a simple-looking game would produce graphics that are so hard for software to perceive.
Should I start off by locally averaging pixels before I start looking for features? Should I reduce color depth by throwing out the least significant bits of the pixel color values?
I'm trying for a pure Java solution (actually programming in Clojure/Java mix) so I'm not wild about opencv (which installs .DLL's or .so's with C code). Please don't worry about my choice of language, the learning experience is much more interesting to me than performance.
Being a computer vision guy, I would normally point to feature-extraction and -matching (SIFT, SURF, LBP, etc.), but this is almost certainly an overkill, since most of these methods offer more invariances (=tolerances against transformations) than you actually require (e.g. against rotation, luminance change,...). Also, using features would involve either OpenCV or lots of programming.
So here is my proposal for a simple solution - you judge whether it passes the smartness threshold:
It looks like the image you are looking for has some very distinct structures (the letters, the logos, etc). I would suggest that you do a pixel-to-pixel match for every possible translation, and for a number of different scales (I assume the range of scales is limited) - but only for a small distinctive patch of the image you are looking for (say, a square portion of the yellow text). This is much faster than matching the whole thing. If you want a fancy name for it: In image processing it's called template matching by correlation. The "template" is the thing you are looking for.
Once you have found a few candidate locations for your small distinctive patch, you can verify that you have a hit by testing either the whole image or, more efficiently, a few other distinctive patches of the image (using, of course, the translation / scale you found). This makes your search robust against accidental matches of the original patch without stealing too much performance.
Regarding dithering tolerance, I would go for simple pre-filtering of both images (the template you are looking for, and the image that is your search space). Depending on the properties of the dithering, you can start experimenting with a simple box blur, and probably proceed to a median filter with a small kernel (3 x 3) if that does not work. This will not get you 100% identity between template and searched image, but robust numerical scores you can compare.
Edit in the light of comments
I understand that (1) you want something more robust, more "CV-like" and a bit more fancy as a solution, and that (2) you are skeptical towards achieving scale invariance by simply scanning though a large stack of different scales.
Regarding (1), the canonical approach is, as mentioned above, to use feature descriptors. Feature descriptors do not describe a complete image (or shape), but a small portion of an image in a way that is invariant against various transformations. Have a look at SIFT and SURF, and at VLFeat, which has a good SIFT implementation and also implements MSER and HOG (and is much smaller than OpenCV). SURF is easier to implement than SIFT, both are heavily patented. Both have an "upright" version, which has no rotation invariance. This should increase robustness in your case.
The strategy your describe in your comment goes more in the direction of shape descriptors than image feature descriptors. Make sure that you understand the difference between those! 2D shape descriptors aim at shapes which are typically described by an outline or a binary mask. Image feature descriptors (in the sense use above) aim at images with intensity values, typically photographs. An interesting shape descriptor is shape context, many others are summarized here. I don't think that your problem is best solved by shape descriptors, but maybe I misunderstood something. I would be very careful with shape descriptors on image edges, as edges, being first derivatives, can be strongly altered by dithering noise.
Regarding (2): I'd like to convince you that scanning through a bunch of different scales is not just a stupid hack for those who don't know Computer Vision! Actually, its done a lot in vision, we just have a fancy name for it to mislead the uninitiated - scale space search. That's a bit of an oversimplification, but really just a bit. Most image feature descriptors that are used in practice achieve scale invariance using a scale space, which is a stack of increasingly downscaled (and low-pass filtered) images. The only trick they add is to look for extrema in the scale space and compute descriptors only at those extrema. But still, the complete scale space is computed and traversed to find those extrema. Have a look in the original SIFT paper for a good explanation of this.
Nice. I once implemented some cheat on a flash game by capturing the screen as well :). If you need to find the exact border you gave in the image, you could create a color filter, thus removing all the rest and you end up with a binary image that you could use for further processing (the task at hand would be to find a matching rectangle with certain border ratio. Also, you could implement four kernels that would find the corners at a few different scales.
If you have a image stream, and know there is movement, you can also monitor the difference between frames to capture the action parts in your screen by employing a background modeling solution. Combine these and you will get quite far I guess, without resorting to more exotic methods, Like multi scale analysis and stuff.
It performance an issue? My cheat used about 20 fps as it needed to click a ball fast enough.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With