I have the following table storing data about images:
images
- id (int)
- sample_1_1 (int)
- sample_1_2 (int)
- sample_1_3 (int)
- sample_2_1 (int)
- sample_2_2 (int)
- sample_2_3 (int)
- ... # Up until sample_25_3
The task is to calcuate the distance between the collected data. Currently, I am using a 75-dimensional (that's right, 3 * 25 = 75) euclid distance calculation programmed as stored procedures into the database:
CREATE DEFINER=`root`@`localhost`
FUNCTION `distanceBetween`(compareId INT, toId INT) RETURNS double
READS SQL DATA
DETERMINISTIC
BEGIN
DECLARE distance DOUBLE;
SELECT euclidDistance(
i1.sample_1_1, i1.sample_1_2, i1.sample_1_3,
i2.sample_1_1, i2.sample_1_2, i2.sample_1_3,
...
) INTO distance
FROM images i1, (SELECT * FROM images WHERE id = toId) i2
WHERE i1.id = compareId;
RETURN distance;
END
With another subroutine calculating the actual distance between 2 75-dim. vectors:
CREATE DEFINER=`root`@`localhost`
FUNCTION `euclidDistance`(
img1_sample1_1 INT, img1_sample1_2 INT, img1_sample1_3 INT,
img2_sample1_1 INT, img2_sample1_2 INT, img2_sample1_3 INT,
...
) RETURNS double
RETURN SQRT(
quadDiff(img1_sample1_1, img2_sample1_1)
+ quadDiff(img1_sample1_2, img2_sample1_2)
+ quadDiff(img1_sample1_3, img2_sample1_3)
+ ...
)
And another subroutine to calculate the squared difference between two values:
CREATE DEFINER=`root`@`localhost`
FUNCTION `quadDiff`(var1 INT, var2 INT) RETURNS int(11)
RETURN POW(var1 - var2, 2)
The functions itself are perfectly fine and return deterministic results that are both mathematically and logically correct.
The problem comes when I want to get the "closest" images to a given image - meaning the images that have the least distance to any given image. To do so, I use another procedure:
CREATE DEFINER=`root`@`localhost`
PROCEDURE `getSimilarImages`(imageId INT, `limit` INT)
BEGIN
SELECT i2.id, i2.filename, distanceBetween(i1.id, i2.id) distance
FROM images i1, (SELECT * FROM images WHERE id != imageId AND duplicateImageId IS NULL) i2
WHERE i1.id = imageId
ORDER BY distance
LIMIT 10;
END
The database currently features around 30.000 images. This means that a CALL getSimilarImages(123, 10);
takes about 12 seconds to finish. This is way too long for any application, be it web or application based.
So, I want to speed things up. What are my options? Do you see any potential to optimize the process of image comparing or calculating the distances?
I thought about caching the result of the procedure, but I have no clue how to do so. I could also compare every image to every other image as soon as new images are added, but that will adding images a very long process, which is also not acceptable.
I can provide more information on the system setup if it helps, but I appreciate any pointers you can provide. The current situation is not good and I really need to do something, because the image database will only grow with every hour the system is up.
As you've discovered, your ORDER BY distance(a,b) operation is computing a WHOLE lot of those 75-dimensional distances, and unsurprisingly, it's taking a long time. It has to compute the whole lot so it can do the ORDER operation. Ouch.
Here's an observation about Euclidean distance that may help you: the bounding box is a useful approximation. When you're using GetSimilarImages, can you include only images that are within a particular threshold distance of the imageId you're using?
Let's say you only care about images within rad
distance of your imageId. Then you could rework your GetSimilarImages query like this.
PROCEDURE `getSimilarImages`(imageId INT, `limit` INT, rad INT)
BEGIN
SELECT i2.id, i2.filename, distanceBetween(i1.id, i2.id) distance
FROM images i1,
(SELECT * FROM images WHERE id != imageId AND duplicateImageId IS NULL) i2
WHERE i1.id = imageId
AND i1.img_sample_1_1 BETWEEN i2.img_sample_1_1 - rad
AND i2.img_sample_1_1 + rad
AND i1.img_sample_1_2 BETWEEN i2.img_sample_1_2 - rad
AND i2.img_sample_1_2 + rad
AND i1.img_sample_1_3 BETWEEN i2.img_sample_1_3 - rad
AND i2.img_sample_1_3 + rad
ORDER BY distance
LIMIT 10;
END
In this example code I have arbitrarily chosen the first three of your 75 dimensions to use for the bounding-box inclusion code (the three BETWEEN
clauses). For this optimization to work, you'll need to create indexes on at least some of the dimensions used for bounding-box inclusion.
There's nothing special about choosing three dimensions, or about choosing any particular dimensions. If you know from your data that certain of your dimensions discriminate better between your images, those are the ones to choose. You could choose as many dimensions as you need. But, of course there's indexing overhead.
Code a UDF C function or call a native C function that calls a a GPU function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With