Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up multi-dimensional euclid distance calculation in MySQL

I have the following table storing data about images:

images
 - id (int)
 - sample_1_1 (int)
 - sample_1_2 (int)
 - sample_1_3 (int)
 - sample_2_1 (int)
 - sample_2_2 (int)
 - sample_2_3 (int)
 - ... # Up until sample_25_3

The task is to calcuate the distance between the collected data. Currently, I am using a 75-dimensional (that's right, 3 * 25 = 75) euclid distance calculation programmed as stored procedures into the database:

CREATE DEFINER=`root`@`localhost`
FUNCTION `distanceBetween`(compareId INT, toId INT) RETURNS double
    READS SQL DATA
    DETERMINISTIC
BEGIN
 DECLARE distance DOUBLE;
SELECT euclidDistance(
 i1.sample_1_1, i1.sample_1_2, i1.sample_1_3,
 i2.sample_1_1, i2.sample_1_2, i2.sample_1_3,
 ...
 ) INTO distance
FROM images i1, (SELECT * FROM images WHERE id = toId) i2
WHERE i1.id = compareId;
RETURN distance;
END

With another subroutine calculating the actual distance between 2 75-dim. vectors:

CREATE DEFINER=`root`@`localhost`
    FUNCTION `euclidDistance`(
 img1_sample1_1 INT, img1_sample1_2 INT, img1_sample1_3 INT,
 img2_sample1_1 INT, img2_sample1_2 INT, img2_sample1_3 INT,
 ...
 ) RETURNS double
RETURN SQRT(
   quadDiff(img1_sample1_1, img2_sample1_1)
 + quadDiff(img1_sample1_2, img2_sample1_2)
 + quadDiff(img1_sample1_3, img2_sample1_3)
 + ...
)

And another subroutine to calculate the squared difference between two values:

CREATE DEFINER=`root`@`localhost`
FUNCTION `quadDiff`(var1 INT, var2 INT) RETURNS int(11)
RETURN POW(var1 - var2, 2)

The functions itself are perfectly fine and return deterministic results that are both mathematically and logically correct.

The problem comes when I want to get the "closest" images to a given image - meaning the images that have the least distance to any given image. To do so, I use another procedure:

CREATE DEFINER=`root`@`localhost`
PROCEDURE `getSimilarImages`(imageId INT, `limit` INT)
BEGIN
    SELECT i2.id, i2.filename, distanceBetween(i1.id, i2.id) distance
    FROM images i1, (SELECT * FROM images WHERE id != imageId AND duplicateImageId IS NULL) i2
    WHERE i1.id = imageId
    ORDER BY distance
    LIMIT 10;
END

The database currently features around 30.000 images. This means that a CALL getSimilarImages(123, 10); takes about 12 seconds to finish. This is way too long for any application, be it web or application based.

So, I want to speed things up. What are my options? Do you see any potential to optimize the process of image comparing or calculating the distances?

I thought about caching the result of the procedure, but I have no clue how to do so. I could also compare every image to every other image as soon as new images are added, but that will adding images a very long process, which is also not acceptable.

I can provide more information on the system setup if it helps, but I appreciate any pointers you can provide. The current situation is not good and I really need to do something, because the image database will only grow with every hour the system is up.

like image 514
F.P Avatar asked Aug 03 '12 09:08

F.P


2 Answers

As you've discovered, your ORDER BY distance(a,b) operation is computing a WHOLE lot of those 75-dimensional distances, and unsurprisingly, it's taking a long time. It has to compute the whole lot so it can do the ORDER operation. Ouch.

Here's an observation about Euclidean distance that may help you: the bounding box is a useful approximation. When you're using GetSimilarImages, can you include only images that are within a particular threshold distance of the imageId you're using?

Let's say you only care about images within rad distance of your imageId. Then you could rework your GetSimilarImages query like this.

PROCEDURE `getSimilarImages`(imageId INT, `limit` INT, rad INT)
BEGIN
    SELECT i2.id, i2.filename, distanceBetween(i1.id, i2.id) distance
    FROM images i1, 
    (SELECT * FROM images WHERE id != imageId AND duplicateImageId IS NULL) i2
    WHERE i1.id = imageId
      AND i1.img_sample_1_1 BETWEEN i2.img_sample_1_1 - rad
                                AND i2.img_sample_1_1 + rad
      AND i1.img_sample_1_2 BETWEEN i2.img_sample_1_2 - rad
                                AND i2.img_sample_1_2 + rad
      AND i1.img_sample_1_3 BETWEEN i2.img_sample_1_3 - rad
                                AND i2.img_sample_1_3 + rad
    ORDER BY distance
    LIMIT 10;
END

In this example code I have arbitrarily chosen the first three of your 75 dimensions to use for the bounding-box inclusion code (the three BETWEEN clauses). For this optimization to work, you'll need to create indexes on at least some of the dimensions used for bounding-box inclusion.

There's nothing special about choosing three dimensions, or about choosing any particular dimensions. If you know from your data that certain of your dimensions discriminate better between your images, those are the ones to choose. You could choose as many dimensions as you need. But, of course there's indexing overhead.

like image 145
O. Jones Avatar answered Nov 04 '22 00:11

O. Jones


Code a UDF C function or call a native C function that calls a a GPU function.

like image 36
Tim Child Avatar answered Nov 04 '22 00:11

Tim Child