Postgres / Postgis support distance calculations between points, sorting by distance from a given point and so on, I'm trying to abuse this facility :)
This is an example, just to illustrate what I'm trying to do.
Width Height Thickness Hue Lightness Saturation
Item1 220 157 100 270 60 75
Item2 221 199 105 280 60 75
Item3 210 150 105 100 40 75
Ignoring these capabilities for now, to find close matching items from the table below that are similar in physical dimensions you could do:
SELECT * FROM items
ORDER BY sqrt((i1.width-i2.height)*(i1.width-i2.height)
+ (i1.width-i2.height)*(i1.width-i2.height)
+ (i1.thickness-i2.thickness)*(i1.thickness-i2.thickness))
(call that pseudosql :) )
We can consider the 4th-6th properties to be dimensions such that a similar (or "close") color has a similar value for each of those properties. And we can extend the above query by adding in those fields.
Now, Postgres seems to be able to do some pretty great things with indexing and such that would speed up the above query massively, but the data types only seem to go up to 4 dimensions.
What would be a good solution for calculating, and more often sorting records similarity of 10 scales (or "dimensions") for millions of records?
I'm currently thinking of 3 4D points and sorting by the distance of the 3 points.
Without the Postgis thing I guess the question would be:-
In a table with items as follows, what would be the fastest way to sort by the sum of the difference between 10 values input (one per val field), taking advantage of whatever possible, bounding boxes, spacial or other indexes etc.
eg.
SELECT * FROM items
ORDER BY ((item_val1-xxxx)*(item_val1-xxxx)
* (item_val2-yyyy)*(item_val2-yyyy)
* ......
* (item_val10-zzzz)*(item_val10-zzzz))
where xxxx,yyyy, zzzz represent values input in the front end application
item_name: string
item_val1: int
item_val2: int
item_val3: int
item_val4: int
item_val5: int
item_val6: int
item_val7: int
item_val8: int
item_val9: int
item_val10: int
Any ideas / alternatives greatly appreciated.
PostGIS is great for GIS, but as John B pointed out, the cube extension is well tailored to work with n-dimensional Eucludian distances.
Add the extension, and add a new column with GiST index to the table:
CREATE EXTENSION cube;
ALTER TABLE items ADD COLUMN point cube;
CREATE INDEX items_point_idx ON items USING gist (point);
Populate the point field, and you may want to weight some attributes higher than others by multiplying the column by a scalar (2 for twice, 0.5 for half, etc.):
UPDATE items SET
point = cube(array[width, height, thickness, hue, lightness, saturation]);
Now find some distances using a self-join:
SELECT i1.id, i1.point, i2.id, i2.point, cube_distance(i1.point, i2.point)
FROM items i1, items i2
WHERE i1.id < i2.id
ORDER BY cube_distance(i1.point, i2.point);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With