I am thinking of building an API that would let a program submit a "fingerprint" of an academic publication, match this against a database of articles from Open Access journals, and if found, send the user the canonical citation information. Initially this would be for a specific small research field, so it wouldn't necessarily need to deal with 20 million papers to be successful (even if the 1000 most commonly cited papers in the field were covered, that would be a huge boon for productivity and collaboration).
I wonder what library (which is able to interface with Ruby, ideally) would be the best for doing this "fingerprinting". I've seen Lucene's fuzzy match, but that seems to work on a word level, whereas in this case we would probably want to submit a much larger subset of the document. The reason to do fuzzy matches is that some people might have a Word.doc preprint, some might have the final PDF, etc.
I really appreciate some of the ideas here. Googling for "perceptual hash" get me into a bunch of new material. I tried to summarize many of my findings here.
It seems like SimHash, for example the C implementation would be the way to go, but I still need to experiment more.
You can use pHash for this kind of job.
And this gem will help you to get started:
require 'phash/text'
Phash::Text.new('first.txt') % Phash::Text.new('second.txt')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With