Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get minimum Euclidean distance between a given vector and vectors in the database

I store 128D vectors in PostgreSQL table as double precision []:

create table tab (
   id integer,
   name character varying (200),
   vector double precision []
 )

For a given vector, I need to return one record from the database with the minimum Euclidean distance between this vector and the vector in the table entry.

I have a function that computes the Euclidean distance of two vectors according to the known formula SQRT ((v11-v21) ^ 2 + (v1 [2] -v2 [2]) ^ 2 + .... + (v1 [128] -v2 [128] ]) ^ 2):

CREATE OR REPLACE FUNCTION public.euclidian (
  arr1 double precision [],
  arr2 double precision [])
  RETURNS double precision AS
$ BODY $
  select sqrt (SUM (tab.v)) as euclidian from (SELECT
     UNNEST (vec_sub (arr1, arr2)) as v) as tab;
$ BODY $
LANGUAGE sql IMMUTABLE STRICT

Ancillary function:

CREATE OR REPLACE FUNCTION public.vec_sub (
  arr1 double precision [],
  arr2 double precision [])
RETURNS double precision [] AS
$ BODY $
  SELECT array_agg (result)
    FROM (SELECT (tuple.val1 - tuple.val2) * (tuple.val1 - tuple.val2)
        AS result
        FROM (SELECT UNNEST ($ 1) AS val1
               , UNNEST ($ 2) AS val2
               , generate_subscripts ($ 1, 1) AS ix) tuple
    ORDER BY ix) inn;
$ BODY $
LANGUAGE sql IMMUTABLE STRICT

Query:

select tab.id as tabid, tab.name as tabname,
        euclidian ('{0.1,0.2,...,0.128}', tab.vector) as eucl from tab
order by eulc ASC
limit 1

Everything works fine since I have several thousands of records in tab. But the DB is going to be grown and I need to avoid full scan of tab running the query, add a kind of index search. Would be great to filter-out at least 80% of records by index, the remaining 20% can be handled by full scan.

One of the current directions of the solution search: PostGIS extension allows to search and sort by distance (ST_3DDistance), filter by distance (ST_3DWithin), etc. This works great and fast using indicies. Is it possible to abstract for N-dimensional space?

Some observations:

  • all coordinate values are between [-0.5...0.5] (I do not know exactly, I think [-1.0 ...1.0] are theoretical limits)

  • the vectors are not normalized, the distance from (0,0,... 0) is in range [1.2...1.6].

That is the translated post from StackExchange Russian.

like image 909
Eugene Bartosh Avatar asked Dec 29 '17 08:12

Eugene Bartosh


People also ask

How do you find the Euclidean distance between two vectors?

Euclidean distance is calculated as the square root of the sum of the squared differences between the two vectors.

Is Euclidean distance shortest?

Euclidean distance is the shortest distance between two points in an N dimensional space also known as Euclidean space. It is used as a common metric to measure the similarity between two data points and used in various fields such as geometry, data mining, deep learning and others.


1 Answers

Like @SaiBot hints at with local sensitivity hashing (LSH), there are plenty of researched techniques that allow you to run approximate nearest neighbors (ANN) searches. You have to accept a speed/accuracy tradeoff, but this is reasonable for most production scenarios since a brute-force approach to find the exact neighbors tends to be computationally prohibitive.

This article is an excellent overview of current state-of-the-art algorithms along with their pros and cons. Below, I've linked several popular open-source implementations. All 3 have Python bindings

  • Facebook FAISS
  • Spotify Annoy
  • Google ScaNN
like image 126
Addison Klinke Avatar answered Oct 24 '22 02:10

Addison Klinke