I am using MinHash algorithm to find similar images between images.
I have run across this post, How can I recognize slightly modified images?
which pointed me to MinHash algorithm.
I was using a C# implementation from this blog post, Set Similarity and Min Hash
.
But while trying to use the implementation, I have run into 2 problems.
universe
value to?HashSet
, it only contains distinct byte values; thus comparing values from 1 ~ 256.What is this universe
in MinHash?
And what can I do to improve the C# MinHash implementation?
Since HashSet<byte>
contains values upto 256, similarity value always come out to 1.
Here is the source that uses the C# MinHash implementation from Set Similarity and Min Hash
:
class Program
{
static void Main(string[] args)
{
var imageSet1 = GetImageByte(@".\Images\01.JPG");
var imageSet2 = GetImageByte(@".\Images\02.TIF");
//var app = new MinHash(256);
var app = new MinHash(Math.Min(imageSet1.Count, imageSet2.Count));
double imageSimilarity = app.Similarity(imageSet1, imageSet2);
Console.WriteLine("similarity = {0}", imageSimilarity);
}
private static HashSet<byte> GetImageByte(string imagePath)
{
using (var fs = new FileStream(imagePath, FileMode.Open, FileAccess.Read))
using (var br = new BinaryReader(fs))
{
//List<int> bytes = br.ReadBytes((int)fs.Length).Cast<int>().ToList();
var bytes = new List<byte>(br.ReadBytes((int) fs.Length).ToArray());
return new HashSet<byte>(bytes);
}
}
}
Taking your second question first:
And what can I do to improve the C# MinHash implementation?
You are trying to compare images on the byte
level for files that are inherently structured very differently (you are using a TIFF as one image and a GIF in another). Even if visually these files were exactly the same, your implementation would never find the duplicates unless the files were of the same type.
That said, your minhash implementation should depend on comparable attributes of the images which you hash in order to create signatures.
While the value of the bytes are definitely attributes of the image, they can't be compared to each other if they're in different formats.
For images, you could use, for example, the RGB (and possibly alpha) values for each pixel in the image. These values are comparable no matter what format the image is in (you could use CMYK, or any other color space you want).
However, using the individual values for each pixel will give you poor results. The Jaccard similarity is used to compare the values from each set (regardless of whether or not you hash anything) and because sets don't have any order assigned to them, images that have the same number of pixels of the same color but arranged in different spaces will result in a false positive.
Take for example the following images:
They are both 100px x 100px with 50 pixels of red and 50 pixels of green.
Using the Jaccard similarity to compare the two, you'd get the following (note that since the values are the same, the set only contains one element per color. If you want, you can use the Jaccard bag comparison to compare bags that have multiple counts of the same item, but in this case, the value will turn out to be the same):
Legend:
g = green
r = red
left image = { r, g }
right image = { r, g }
similarity = intersection(left, right) / union(left, right)
similarity = 1 / 1 = 100%
A note about the representation of right image = { r, g }
: because sets are unordered, { r, g }
is the same as { g, r }
, so they are in effect, the same, even without Jaccard comparison being calculated, this point makes it obvious.
But obviously, these images are not the same.
This is why shingling is usually employed in order find distinct mini-regions within the set that can be used collectively to uniquely identify an item.
For images, you can use consecutive RGB values (in this case, going from left-to-right, top-to-bottom, wrapping around when an edge is hit) of a fixed length to generate shingles. In this case, assuming a shingle length of three, your sets look like this (note I'm using square brackets to indicate attributes/vectors, as the shingles are not sets in themselves):
left image = { [r, r, r], [r, r, g], [r, g, g], [g, g, g] }
right image = { [g, g, g], [g, g, r], [g, r, r], [r, r, r] }
And gives you a Jaccard similarity of:
intersection(left, right) = 2
union(right, left) = 6
similarity(left, right) = 2 / 6 = 33.33%
This is a much closer estimate of how similar these images are (in that they're not) than the original.
Note that shingles can be of any length that you choose. You'll have to decide what shingles produce the appropriate Jaccard similarity results (and the threshold) answer the question "how similar are these?"
Now, answering your first question:
what is the universe value?
In this particular case, it's the number of items that can possibly exist in the universe. If you were using single RGB pixels, the universe would be:
255 * 255 * 255 = 16,581,375
With shingling, the value is much, much higher, as you're dealing with combinations of these items. Ideally, you want to generate a set of perfect hash functions for your set of hash functions that minhash. However, because of limitations of type systems (or, because you don't want to store very large numbers in another storage medium), your focus should be on hash functions that minimize collisions.
If you know the number of possible items in the universe of items, then it can help you generate hash functions that reduce the number of collisions.
In the implementation that you reference this universe size is used to generate a random number and then pass those numbers to generate multiple hash functions for minhashing which ideally, would produce minimal collisions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With