Given subject image's hash "100111..10" we want to find all similar image hashes in Elasticsearch index within hamming distance of 8.
Of course, query can return images with greater distance than 8 and script in Elasticsearch or outside can filter the result set. But total search time must be within 1 second or so.
Each document has nested images
field that contains image hashes:
{
"images": {
"type": "nested",
"properties": {
"pHashFingerprint": {"index": "not_analysed", "type": "string"}
}
}
}
Fact: Elasticsearch fuzzy query supports Levenshtein distance of max 2 only.
We used custom tokenizer to split 64 bit string into 4 groups of 16 bits and do 4 group search with four fuzzy queries.
Analyzer:
{
"analysis": {
"analyzer": {
"split4_fingerprint_analyzer": {
"type": "custom",
"tokenizer": "split4_fingerprint_tokenizer"
}
},
"tokenizer": {
"split4_fingerprint_tokenizer": {
"type": "pattern",
"group": 0,
"pattern": "([01]{16})"
}
}
}
}
Then new field mapping:
"index_analyzer": "split4_fingerprint_analyzer",
Then query:
{
"query": {
"filtered": {
"query": {
"nested": {
"path": "images",
"query": {
"bool": {
"minimum_should_match": 2,
"should": [
{
"fuzzy": {
"phashFingerprint.split4": {
"value": "0010100100111001",
"fuzziness": 2
}
}
},
{
"fuzzy": {
"phashFingerprint.split4": {
"value": "1010100100111001",
"fuzziness": 2
}
}
},
{
"fuzzy": {
"phashFingerprint.split4": {
"value": "0110100100111001",
"fuzziness": 2
}
}
},
{
"fuzzy": {
"phashFingerprint.split4": {
"value": "1110100100111001",
"fuzziness": 2
}
}
}
]
}
}
}
},
"filter": {}
}
}
}
Note that we return documents that have matching images, not the images themselves, but that should not change things a lot.
The problem is that this query returns hundreds of thousands of results even after adding other domain-specific filters to reduce initial set. Script has too much work to calculate hamming distance again, therefore query can take minutes.
As expected, if increasing minimum_should_match
to 3 and 4, only subset of images that must be found are returned, but resulting set is small and fast. Below 95% of needed images are returned with minimum_should_match
== 3 but we need 100% (or 99.9%) like with minimum_should_match
== 2.
We tried similar approaches with n-grams, but still not much success in the similar fashion of too many results.
Any solutions of other data structures and queries?
Edit:
We noticed, that there was a bug in our evaluation process, and minimum_should_match
== 2 returns 100% of results. However, processing time afterwards takes on average 5 seconds. We will see if script is worth optimising.
I also implemented the CUDA approach with some good results even on a laptop GeForce 650M graphics card. Implementation was easy with Thrust library. I hope the code doesn't have bugs (I didn't thoroughly test it) but it shouldn't affect benchmark results. At least I called thrust::system::cuda::detail::synchronize()
before stopping the high-precision timer.
typedef unsigned __int32 uint32_t;
typedef unsigned __int64 uint64_t;
// Maybe there is a simple 64-bit solution out there?
__host__ __device__ inline int hammingWeight(uint32_t v)
{
v = v - ((v>>1) & 0x55555555);
v = (v & 0x33333333) + ((v>>2) & 0x33333333);
return ((v + (v>>4) & 0xF0F0F0F) * 0x1010101) >> 24;
}
__host__ __device__ inline int hammingDistance(const uint64_t a, const uint64_t b)
{
const uint64_t delta = a ^ b;
return hammingWeight(delta & 0xffffffffULL) + hammingWeight(delta >> 32);
}
struct HammingDistanceFilter
{
const uint64_t _target, _maxDistance;
HammingDistanceFilter(const uint64_t target, const uint64_t maxDistance) :
_target(target), _maxDistance(maxDistance) {
}
__host__ __device__ bool operator()(const uint64_t hash) {
return hammingDistance(_target, hash) <= _maxDistance;
}
};
Linear searching was as easy as
thrust::copy_if(
hashesGpu.cbegin(), hashesGpu.cend(), matchesGpu.begin(),
HammingDistanceFilter(target_hash, maxDistance)
)
Searching was 100% accurate and way faster than my ElasticSearch answer, in 50 milliseconds CUDA could stream through 35 million hashes! I'm sure newer desktop cards are even way faster than this. Also we get very low variance and consistent linear growth of search time as we go through more and more data. ElasticSearch hit bad memory problems on larger queries due to inflated sampling data.
So here I'm reporting results of "From these N hashes, find those which are within 8 Hamming distance from a single hash H". I run these 500 times and reported percentiles.
There is some kernel launch overhead but after the search space is more than 5 million hashes the searching speed is fairly consistent at 700 million hashes / second. Naturally the upper bound on number of hashes to be searched is set by GPU's RAM.
Update: I re-run my tests on GTX 1060 and it scans about 3800 million hashes per second :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With