Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Suggest an algorithm for color pattern matching against a large known set

I have a requirement that calls for matching a Sample Set of color values against a Known Set of values to find either an exact match, or matches that are within an acceptable distance. I'm not entirely sure what algorithm would be best suited for this and I'm looking for suggestions.

I thought about using a SQL query as I think this would be a straightforward approach, however, ideally this would be done in-memory on the application server or even on a GPU for maximum speed.

Example:

Let's say we are given a set of three RGB color values, two blues and an orange:

Sample Set:

Color 1: 81,177,206 (blue)

Color 2: 36, 70, 224 (blue)

Color 3: 255, 132, 0 (orange)

This set of 3 color values must be matched against a much larger set of color values to see if this set exists within it, either with the same exact RGB values for each of the 3 colors - or - if any pattern exists where an RGB value of the colors varies by an acceptable degree. Let's assume any of the RGB components can be up to 3 digits higher or lower in value.

Let's say our big set of known color values that we'll search against looks like this:

Known Set:

            Color 1          Color 2       Color 3
Sample A: [25, 25, 25],    [10, 10, 10], [100, 100, 100] 

Sample B: [125, 125, 125], [10, 10, 10], [200, 200, 200] 

Sample C: [13, 87, 255],   [10, 10, 10], [100, 100, 100] 

Sample D: [67, 111, 0],    [10, 10, 10], [200, 200, 200] 

Sample E: [255, 255, 255], [10, 10, 10], [100, 100, 100] 

Given this scenario, we would find zero matches when we run our sample set against it, because none of the known colors have a Color 1 that is anywhere close to our sample set values. However, let's add another color to the Known Set that would return a positive match:

Sample F: [81,177,206], [36, 70, 224], [255, 132, 0]

If Sample F existed with these values in the Known Set, we would get a positive hit, because it's the exact RGB values as Color 1 in our Sample Set. Also, we need to accept a varying degree of differences in the RGB values, so the following would also return positive hits because each RGB value is within 3 digits from Color 1's values from the Sample Set:

Positive hits: (remember Color 1 is: 81,177,206)

Sample F: 80,177,206 (red channel is 1 digit away)

Sample F: 81,175,204 (green and blue channels 2 digits away)

Sample F: 82,179,208 (all three channels within 3 digits away)

However, if the distance is too great, then a match would not be found. Any RGB component must be within 3 digits to trigger a positive result. So if Sample F looked like the following, we would not get a positive result because the distance is too great:

Negative hits:

Sample F: 85,177,206 (red channel is 4 digits away)

Sample F: 81,170,206 (green channel is 7 digits away)

Sample F: 81,177,200 (blue channel is 6 digits away)

So far we've just taken Color 1 from the Sample Set into consideration. However the requirement calls for taking the entire Sample Set into consideration. So if no positive matches can be found for Color 1, then we assume no match at all and don't consider Colors 2 and 3 from the Sample Set.

However, if we find a positive result for Color 1, let's say 80,177,206 which is just 1 digit off in the Red channel 80 vs 81, then we do continue processing Color 2, and if we find a positive match for that then we process Color 3 and so on.

What are your suggestions for the algorithm best suited for this problem? I need something that will allow the Known Set to scale very large without too much of a performance hit. There will probably be 1M+ samples in the Known Set at scale.

I thought about using hashtables, one per Color to construct the Known Set. So I could test for a match on Color 1, and if found, test the hashtable for Color 2, and stop when I find no more hits. If I got through all 3 colors/hashtables with positive hits, then I'd have an overall positive match, else I would not. However, this approach doesn't allow for the variance needed in each of the RGB channels for each color. There would be too many combinations to allow for constructing hashtables to hold it all.

Thanks in advance and thanks for reading this far down!

like image 898
znelson Avatar asked Oct 19 '22 14:10

znelson


2 Answers

Keep one sorted list. Sort it three times with a stable sort, first by B, then by G, then by R. This leaves it sorted in RGB order. Given your input, find the indexes for the first and last acceptable R's with a binary search. Then search that range for acceptable G's, then search the again reduced range for B's. The whole thing should be O(lgN).

-- Unless I am missing something this solution generalizes to matching a set of 3 colors, or 10 colors or k colors. Generate an array of indexes into your list of color sets. To prepare, stable sort the indexes 3*k times as above. To seek, do 3*k binary searches in the opposite order.

(This assumes the colors are fixed into columns. If they are not, you can still use this method, but your index list goes to N*k size: it needs an entry for A1, A2, A3. And at the end add a check that you have matched one from each column.)

like image 63
AShelly Avatar answered Oct 22 '22 03:10

AShelly


In the end, after experimenting with SQL and GPU programming (Cudafy), the fastest, easiest, and most debuggable solution was to simply iterate through the data using Parallel.For(). This approach yielded 1.5M samples processed (90M total bytes) in 18ms.

like image 43
znelson Avatar answered Oct 22 '22 04:10

znelson