I've seen a few questions here related to determining the similarity of files, but they are all linked to a particular domain (images, sounds, text, etc). The techniques offered as solutions require knowledge of the underlying file format of the files being compared. What I am looking for is a method without this requirement, where arbitrary binary files could be compared without needing to understand what type of data they contain. That is, I am looking to determine the similarity percentage of two files' binary data.
To give a little more detail for you to work with, even though this is potentially applicable to many things, I do have a specific problem that I'm working on. I also currently have a working solution, but I don't think that it is ideal. There are probably many optimizations in terms of the comparison method, and storing the results. Hopefully some people here will be able to give me some new ideas. I will probably edit in some information about my current method after a couple of days, but I don't want to bias peoples' thoughts about the problem by telling you how I'm already doing it.
The problem I'm working on is clone detection for video game ROM images. For those that don't have experience with emulation, ROMs are dumps of the data on game cartridges. A ROM "clone" is typically a modified version of the same game, the most common type being a translated version. For example, the Japanese and English versions of the original Final Fantasy for the NES are clones. The games share almost all of their assets (sprites, music, etc), but the text has been translated.
There are currently several groups that work on maintaining lists of clones for the various systems, but as far as I can tell, this is all done manually. What I am attempting to do is find a method to detect similar ROM images automatically and objectively, based on data similarity instead of "these seem like the same game". There are several reasons for detecting clones, but one of the major motivations is to be used with Solid compression. This allows compression of all game clones together into the same archive, with the entire compressed clone set often taking up only slightly more space than one of the individual ROMs.
Some concerns to consider when coming up with potential approaches:
It's been an interesting problem to work on, I look forward to seeing what other people can come up with. Let me know in the comments if you want any more details, and I'll try to supply them.
To calculate the similarity between two examples, you need to combine all the feature data for those two examples into a single numeric value. For instance, consider a shoe data set with only one feature: shoe size. You can quantify how similar two shoes are by calculating the difference between their sizes.
This is a binary version of the inner (dot) product. Equal weight is given to matches and nonmatches. This is the default for binary similarity data. Simple matching. This is the ratio of matches to the total number of values.
Typically, the Jaccard similarity coefficient (or index) is used to compare the similarity between two sets. For two sets, A and B , the Jaccard index is defined to be the ratio of the size of their intersection and the size of their union: J(A,B) = (A ∩ B) / (A ∪ B)
It sounds like you want a binary delta or perhaps an index derived from the application of a binary delta (like it's size). You could then compare this index to some baseline that you determine experimentally to decide if it's a "clone" or not.
There are a lot of similarities between compression and delta creation, so I'd say you aren't far off with your current implementation.
That being said, pairwise comparison of every binary file in your database is probably prohibitively expensive (O(n2), I think). I would try to find a simple hash for identifying possible candidates for comparison. Something conceptually similar to what spdenne and Eduard are suggesting. That is, find a hash which can be applied to every item once, sort that list and then use a finer grained comparison on items whose hashes are close together in the list.
Constructing hashes useful for the general case has been an actively pursued research topic in CS for several years. The LSHKit software library implements some algorithms of this sort. The internet accessible paper FINDING SIMILAR FILES IN A LARGE FILE SYSTEM seems like it might be targeted more at comparing text files but might be useful to you. The more recent paper Multi-resolution similarity hashing describes a more powerful algorithm. It doesn't appear to be accessible without a subscription, though. You probably want to keep the wikipedia article on Locality Sensitive Hashing handy as you browse the other resources. They all get pretty technical and the wikipedia entry itself is pretty math heavy. As a more user-friendly alternative you might be able to apply some ideas (or even executables) from the field of Acoustic Fingerprinting.
If you're willing to abandon the general case it's likely that you can find a much simpler (and faster) domain-specific hash function that works just for your ROMs. Possibly something involving the placement of standard, or common, byte sequences and the value of select bits near them. I don't really know much about your binary format but I'm imagining things that signal the start of sections in the file like regions for sound, images or text. Binary formats frequently store the addresses of these sorts of sections near the beginning of the file. Some also use a chaining mechanism that stores the address of the first section at a known location along with it's size. This allows you to move to the next section which also contains a size, etc. A little investigation will probably allow you to discover any relevant formatting, if you're not already aware of it, and should put you well on your way to constructing a useful hash.
If the hash functions don't get you all the way (or they require input of some sort to define a metric/distance) then there are several binary delta algorithms and implementations available on the web. The one I'm most familiar with is used by the subversion version control system. It uses a binary delta algorithm called xdelta to efficiently store binary file revisions. Here's a link directly to the file in their repository that implements it: xdelta.c. There's probably a tool on the web that makes this more accessible as well.
You might want to look at bsdiff, which is a binary diffing/patching system. There is also a thesis with lots of theory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With