manydimentional distance in postgres

Question

Postgres / Postgis support distance calculations between points, sorting by distance from a given point and so on, I'm trying to abuse this facility :)

This is an example, just to illustrate what I'm trying to do.

       Width    Height  Thickness   Hue Lightness   Saturation     
Item1  220      157     100         270 60          75     
Item2  221      199     105         280 60          75     
Item3  210      150     105         100 40          75

Ignoring these capabilities for now, to find close matching items from the table below that are similar in physical dimensions you could do:

SELECT * FROM items
ORDER BY sqrt((i1.width-i2.height)*(i1.width-i2.height)
            + (i1.width-i2.height)*(i1.width-i2.height)
            + (i1.thickness-i2.thickness)*(i1.thickness-i2.thickness))

(call that pseudosql :) )

We can consider the 4th-6th properties to be dimensions such that a similar (or "close") color has a similar value for each of those properties. And we can extend the above query by adding in those fields.

Now, Postgres seems to be able to do some pretty great things with indexing and such that would speed up the above query massively, but the data types only seem to go up to 4 dimensions.

What would be a good solution for calculating, and more often sorting records similarity of 10 scales (or "dimensions") for millions of records?

I'm currently thinking of 3 4D points and sorting by the distance of the 3 points.

Without the Postgis thing I guess the question would be:-

In a table with items as follows, what would be the fastest way to sort by the sum of the difference between 10 values input (one per val field), taking advantage of whatever possible, bounding boxes, spacial or other indexes etc.

eg.

SELECT * FROM items
ORDER BY ((item_val1-xxxx)*(item_val1-xxxx)
        * (item_val2-yyyy)*(item_val2-yyyy)
        * ......
        * (item_val10-zzzz)*(item_val10-zzzz))

where xxxx,yyyy, zzzz represent values input in the front end application

item_name: string
item_val1: int
item_val2: int
item_val3: int
item_val4: int
item_val5: int
item_val6: int
item_val7: int
item_val8: int
item_val9: int
item_val10: int

Any ideas / alternatives greatly appreciated.

Mike T · Accepted Answer

PostGIS is great for GIS, but as John B pointed out, the cube extension is well tailored to work with n-dimensional Eucludian distances.

Add the extension, and add a new column with GiST index to the table:

CREATE EXTENSION cube;
ALTER TABLE items ADD COLUMN point cube;
CREATE INDEX items_point_idx ON items USING gist (point);

Populate the point field, and you may want to weight some attributes higher than others by multiplying the column by a scalar (2 for twice, 0.5 for half, etc.):

UPDATE items SET
  point = cube(array[width, height, thickness, hue, lightness, saturation]);

Now find some distances using a self-join:

SELECT i1.id, i1.point, i2.id, i2.point, cube_distance(i1.point, i2.point)
FROM items i1, items i2
WHERE i1.id < i2.id
ORDER BY cube_distance(i1.point, i2.point);

manydimentional distance in postgres

Tags:

postgresql

postgis

Anthony Graham

1 Answers

Mike T

Recent Activity

Donate For Us

manydimentional distance in postgres

Tags:

postgresql

postgis

Anthony Graham

1 Answers

Mike T

Related questions

Recent Activity

Donate For Us