Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of Delta E (CIE Lab) calculating and sorting in SQL

I have a database table where each row is a color. My goal: given an input color, calculate its distance to each color in the DB table, and sort the results by that distance. Or, stated as a user story: when I choose a color, I want to see a list of the colors that are most similar to the one that I picked, with the closest matches at the top of the list.

I understand that, in order to do this, the various Delta E (CIE Lab) formulae are the best choice. I wasn't able to find any native SQL implementations of the formulae, so I wrote my own SQL versions of Delta E CIE 1976 and Delta E CIE 2000. I verified the accuracy of my SQL versions of the formulae, against the results generated by the python-colormath implementations.

The 1976 formula is easy to write, in SQL or in any other language, because it's a simple Euclidean distance calculation. It performs nice and fast for me, on datasets of any size (tested it on a color table with 100,000 rows, and the query takes less than 1 second).

The 2000 formula, in contrast, is very long and complex. I managed to implement it in SQL, but its performance is not great: about 5 seconds to query 10,000 rows, and about 1 minute to query 100,000 rows.

I wrote an example app called colorsearchtest (in Python / Flask / Postgres), to play around with my implementations (and I set up a demo on Heroku). If you try out this app, you can clearly see the performance difference between the 1976 and the 2000 Delta E queries.

This is the schema for the color table (for each color, it stores the respective RGB and Lab representations, as three numeric values):

CREATE TABLE color (
    id integer NOT NULL,
    rgb_r integer,
    rgb_g integer,
    rgb_b integer,
    lab_l double precision,
    lab_a double precision,
    lab_b double precision
);

This is some data in the table (all just random colors, generated by a script in my app):

INSERT INTO color (id, rgb_r, rgb_g, rgb_b, lab_l, lab_a, lab_b)
VALUES (902, 164, 214, 189, 81.6521019943304793,
        -21.2561872439361323, 7.08354581694699004);

INSERT INTO color (id, rgb_r, rgb_g, rgb_b, lab_l, lab_a, lab_b)
VALUES (903, 113, 229, 64, 81.7930860963098212,
        -60.5865728472875205, 66.4022741184551819);

INSERT INTO color (id, rgb_r, rgb_g, rgb_b, lab_l, lab_a, lab_b)
VALUES (904, 65, 86, 78, 34.6593864327796624,
        -9.95482220634028003, 2.02661293272071719);

...

And this is the Delta E CIE 2000 SQL function that I'm using:

CREATE OR REPLACE FUNCTION
DELTA_E_CIE2000(double precision, double precision,
                double precision, double precision,
                double precision, double precision,
                double precision, double precision,
                double precision)
RETURNS double precision
AS $$

WITH
    c AS (SELECT
            (CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR))
        AS lab_pair_str,
            (($1 + $4) /
                2.0)
        AS avg_lp,
            SQRT(
                POW($2, 2.0) +
                POW($3, 2.0))
        AS c1,
            SQRT(
                POW(($5), 2.0) +
                POW(($6), 2.0))
        AS c2),
    gs AS (SELECT
            c.lab_pair_str,
            (0.5 *
                (1.0 - SQRT(
                    POW(((c.c1 + c.c2) / 2.0), 7.0) / (
                        POW(((c.c1 + c.c2) / 2.0), 7.0) +
                        POW(25.0, 7.0)))))
        AS g
        FROM c
        WHERE c.lab_pair_str = (
            CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR))),
    ap AS (SELECT
            gs.lab_pair_str,
            ((1.0 + gs.g) * $2)
        AS a1p,
            ((1.0 + gs.g) * $5)
        AS a2p
        FROM gs
        WHERE gs.lab_pair_str = (
            CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR))),
    cphp AS (SELECT
            ap.lab_pair_str,
            SQRT(
                POW(ap.a1p, 2.0) +
                POW($3, 2.0))
        AS c1p,
            SQRT(
                POW(ap.a2p, 2.0) +
                POW($6, 2.0))
        AS c2p,
            (
                DEGREES(ATAN2($3, ap.a1p)) + (
                    CASE
                        WHEN DEGREES(ATAN2($3, ap.a1p)) < 0.0
                        THEN 360.0
                        ELSE 0.0
                        END))
        AS h1p,
            (
                DEGREES(ATAN2($6, ap.a2p)) + (
                    CASE
                        WHEN DEGREES(ATAN2($6, ap.a2p)) < 0.0
                        THEN 360.0
                        ELSE 0.0
                        END))
        AS h2p
        FROM ap
        WHERE ap.lab_pair_str = (
            CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR))),
    av AS (SELECT
            cphp.lab_pair_str,
            ((cphp.c1p + cphp.c2p) /
                2.0)
        AS avg_c1p_c2p,
            (((CASE
                WHEN (ABS(cphp.h1p - cphp.h2p) > 180.0)
                THEN 360.0
                ELSE 0.0
                END) +
              cphp.h1p +
              cphp.h2p) /
                2.0)
        AS avg_hp
        FROM cphp
        WHERE cphp.lab_pair_str = (
            CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR))),
    ts AS (SELECT
            av.lab_pair_str,
            (1.0 -
                0.17 * COS(RADIANS(av.avg_hp - 30.0)) +
                0.24 * COS(RADIANS(2.0 * av.avg_hp)) +
                0.32 * COS(RADIANS(3.0 * av.avg_hp + 6.0)) -
                0.2 * COS(RADIANS(4.0 * av.avg_hp - 63.0)))
        AS t,
            ((
                    (cphp.h2p - cphp.h1p) +
                    (CASE
                        WHEN (ABS(cphp.h2p - cphp.h1p) > 180.0)
                        THEN 360.0
                        ELSE 0.0
                        END))
                -
                (CASE
                    WHEN (cphp.h2p > cphp.h1p)
                    THEN 720.0
                    ELSE 0.0
                    END))
        AS delta_hlp
        FROM av
        INNER JOIN cphp
        ON av.lab_pair_str = cphp.lab_pair_str
        WHERE av.lab_pair_str = (
            CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR))),
    d AS (SELECT
            ts.lab_pair_str,
            ($4 - $1)
        AS delta_lp,
            (cphp.c2p - cphp.c1p)
        AS delta_cp,
            (2.0 * (
                SQRT(cphp.c2p * cphp.c1p) *
                SIN(RADIANS(ts.delta_hlp) / 2.0)))
        AS delta_hp,
            (1.0 + (
                (0.015 * POW(c.avg_lp - 50.0, 2.0)) /
                SQRT(20.0 + POW(c.avg_lp - 50.0, 2.0))))
        AS s_l,
            (1.0 + 0.045 * av.avg_c1p_c2p)
        AS s_c,
            (1.0 + 0.015 * av.avg_c1p_c2p * ts.t)
        AS s_h,
            (30.0 * EXP(-(POW(((av.avg_hp - 275.0) / 25.0), 2.0))))
        AS delta_ro,
            SQRT(
                (POW(av.avg_c1p_c2p, 7.0)) /
                (POW(av.avg_c1p_c2p, 7.0) + POW(25.0, 7.0)))
        AS r_c
        FROM ts
        INNER JOIN cphp
        ON ts.lab_pair_str = cphp.lab_pair_str
        INNER JOIN c
        ON ts.lab_pair_str = c.lab_pair_str
        INNER JOIN av
        ON ts.lab_pair_str = av.lab_pair_str
        WHERE ts.lab_pair_str = (
            CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR))),
    r AS (SELECT
            d.lab_pair_str,
            (-2.0 * d.r_c * SIN(2.0 * RADIANS(d.delta_ro)))
        AS r_t
        FROM d
        WHERE d.lab_pair_str = (
            CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR)))
SELECT
        SQRT(
            POW(d.delta_lp / (d.s_l * $7), 2.0) +
            POW(d.delta_cp / (d.s_c * $8), 2.0) +
            POW(d.delta_hp / (d.s_h * $9), 2.0) +
            r.r_t *
            (d.delta_cp / (d.s_c * $8)) *
            (d.delta_hp / (d.s_h * $9)))
    AS delta_e_cie2000
FROM          r
INNER JOIN    d
ON            r.lab_pair_str = d.lab_pair_str
WHERE         r.lab_pair_str = (
          CAST($1 AS VARCHAR) || ',' ||
          CAST($2 AS VARCHAR) || ',' ||
          CAST($3 AS VARCHAR) || ',' ||
          CAST($4 AS VARCHAR) || ',' ||
          CAST($5 AS VARCHAR) || ',' ||
          CAST($6 AS VARCHAR))

$$

LANGUAGE SQL
IMMUTABLE
RETURNS NULL ON NULL INPUT;

(I originally wrote this function using nested subqueries about 10 levels deep, but I then re-wrote it to instead use WITH statements, i.e. Postgres CTEs. The new version is much more readable, and performance is similar to the old version. You can see both versions in the code.)

After defining the function, I use it in a query like this:

SELECT        c.rgb_r,
              c.rgb_g,
              c.rgb_b,
        DELTA_E_CIE2000(73.9206633504, -50.2996953437,
                        23.8259166281,
                        c.lab_l, c.lab_a, c.lab_b,
                        1.0, 1.0, 1.0)
    AS de2000
FROM          color c
ORDER BY      de2000
LIMIT         100;

So, my question: is there any way that I could improve the performance of the DELTA_E_CIE2000 function, to make it usable in real-time for non-trivial data sets? Or, considering the complexity of the formula, is that as fast as it's going to get?

From the testing that I've done in my demo app, I'd say that for the use case of a simple "similar colors" search on a web site, the difference in results accuracy between the 1976 and 2000 functions is actually negligible. That is, I'm already confident that for my needs, the 1976 formula is "good enough". However, the 2000 function does return slightly better results (depending very much on where the input color lies in the Lab space), and really, I'm just curious as to whether it could be sped up further.

like image 405
Jaza Avatar asked Aug 04 '15 00:08

Jaza


1 Answers

Two things: 1) you are not using the database to its full extent and 2) your problem is a great example for a custom PostgreSQL extension. Here's why.

You are only using database as storage, storing colors as floats. In your current configuration, regardless of the type of query, the database will always have to check all values (make a sequential scan). This means a lot of IO and a lot of calculation for few returned matches. You are trying to find the nearest N colors, so there are a few possibilities on how to avoid performing calculations on all data.

Simple improvement

Simplest is to limit your calculations to a smaller subset of data. You can assume the difference will be greater if the components differ more. If you can find a safe difference between the components, where the results are always inappropriate, you can exclude those colors altogether using ranged WHERE with btree indexes. However, due to nature of L*a*b colorspace, this will likely worsen your results.

First create the indexes:

CREATE INDEX color_lab_l_btree ON color USING btree (lab_l);
CREATE INDEX color_lab_a_btree ON color USING btree (lab_a);
CREATE INDEX color_lab_b_btree ON color USING btree (lab_b);

Then I adapted your query to include a WHERE clause to filter only colors, where any of the components differs for at most 20.

Update: After another look, adding a limit of 20 will very likely worsen the results, since I found at least one point in space, for which this holds true.:

SELECT 
    c.rgb_r, c.rgb_g, c.rgb_b,
    DELTA_E_CIE2000(
        25.805780252087963, 53.33446637366859, -45.03961353720049, 
        c.lab_l, c.lab_a, c.lab_b,
        1.0, 1.0, 1.0) AS de2000
FROM color c 
WHERE 
    c.lab_l BETWEEN 25.805780252087963 - 20 AND 25.805780252087963 + 20 
    AND c.lab_a BETWEEN 53.33446637366859 - 20 AND 53.33446637366859 + 20 
    AND c.lab_b BETWEEN -45.03961353720049 - 20 AND -45.03961353720049 + 20 
ORDER BY de2000 ;

I filled the table with 100000 of random colors with your script and tested:

Time without indexes: 44006,851 ms

Time with indexes and range query: 1293,092 ms

You can add this WHERE clause to delta_e_cie1976_query too, on my random data it drops the query time from ~110 ms to ~22 ms.

BTW: I got the number 20 empirically: I tried with 10, but got only 380 records, which seems a little low and might exclude some better options since the limit is 100. With 20 the full set was 2900 rows and one can be fairly sure that the closest matches will be there. I didn't study the DELTA_E_CIE2000 or L*a*b* color space in detail so the threshold may need adjustment along different components for that to actually be true, but the principle of excluding non-interesting data holds.

Rewrite Delta E CIE 2000 in C

As you've already said, Delta E CIE 2000 is complex and fairly unsuitable for implementing in SQL. It currently uses about 0.4 ms per call on my laptop. Implementing it in C should considerably speed this up. PostgreSQL assigns default cost to SQL functions as 100 and C functions as 1. I'm guessing this is based on real experience.

Update: Since this also scratches one of my itches, I reimplemented the Delta E functions from colormath module in C as a PostgreSQL extension, available on PGXN. With this I can see a speedup of about 150x for CIE2000 when querying all the records from the table with 100k records.

With this C function, I get query times between 147 ms and 160 ms for 100k colors. With extra WHERE, query time is about 20 ms, which seems quite acceptable for me.

Best, but advanced solution

However, since your problem is N nearest neighbor search in 3-dimensional space, you could use K-Nearest-Neighbor Indexing which is in PostgreSQL since version 9.1.

For that to work, you'd put L*a*b* components into a cube. This extension does not yet support distance operator (it's in the works), but even if it would, it would not support Delta E distances and you would need to reimplement it as a C extension.

This means implementing GiST index operator class (btree_gist PostgreSQL extension in contrib does this) to support indexing according to Delta E distances. The good part is you could then use different operators for different versions of Delta E, eg. <-> for Delta E CIE 2000 and <#> for Delta E CIE 1976 and queries would be really really fast for small LIMIT even with Delta E CIE 2000.

In the end it may depend on what your (business) requirements and constraints are.

like image 81
hruske Avatar answered Oct 11 '22 05:10

hruske