I want to index data in height dimensions (128 dimensional vectors of integers in range of [0,254] are possible): <pre class="prettyprint"><code>| id | vector | | 1 | { 1, 0, ..., 254} | | 2 | { 2, 128, ...,1} | | . | { 1, 0, ..., 252} | | n | { 1, 2, ..., 251} | </code></pre> I saw that PostGIS implemented R-Trees. So can I use these trees in PostGIS to index and query multidimensional vectors in Postgres? I also saw that there is a index implementation for int arrays. Now I have questions about how to perform a query. Can I perform a knn-search and a radius search on an integer array? Maybe I also must define my own distance function. Is this possible? I want to use the Manhattan distance (block distance) for my queries. I also can represent my vector as a binary string with the pattern <code>v1;v2;...;vn</code>. Does this help to perform the search? For example if I had these two string: <pre class="prettyprint"><code>1;2;1;1 1;3;2;2 </code></pre> The result / distance between these two strings should be 3.

Perhaps a better choice would be the cube extension, since your area of interest is not individual integer, but full vector. Cube supports GiST indexing, and Postgres 9.6 will also bring KNN indexing to cubes, supporting euclidean, taxicab (aka Manhattan) and chebishev distances. It is a bit annoying that 9.6 is still in development, however there's no problem backporting patch for cube extension to 9.5 and I say that from experience. Hopefully 128 dimensions will still be enough to get meaningful results. How to do this? First have an example table: <pre class="prettyprint"><code>create extension cube; create table vectors (id serial, vector cube); </code></pre> Populate table with example data: <pre class="prettyprint"><code>insert into vectors select id, cube(ARRAY[round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000)]) from generate_series(1, 2000000) id; </code></pre> Then try selecting: <pre class="prettyprint"><code>explain analyze SELECT * from vectors order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Limit (cost=123352.07..123352.09 rows=10 width=76) (actual time=1705.499..1705.501 rows=10 loops=1) -> Sort (cost=123352.07..129852.07 rows=2600000 width=76) (actual time=1705.496..1705.497 rows=10 loops=1) Sort Key: (('(966, 82, 765, 343, 600, 718, 338, 505)'::cube <#> vector)) Sort Method: top-N heapsort Memory: 26kB -> Seq Scan on vectors (cost=0.00..67167.00 rows=2600000 width=76) (actual time=0.038..998.864 rows=2600000 loops=1) Planning time: 0.172 ms Execution time: 1705.541 ms (7 rows) </code></pre> We should create an index: <pre class="prettyprint"><code>create index vectors_vector_idx on vectors (vector); </code></pre> Does it help: <pre class="prettyprint"><code>explain analyze SELECT * from vectors order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10; -------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.41..1.93 rows=10 width=76) (actual time=41.339..143.915 rows=10 loops=1) -> Index Scan using vectors_vector_idx on vectors (cost=0.41..393704.41 rows=2600000 width=76) (actual time=41.336..143.902 rows=10 loops=1) Order By: (vector <#> '(966, 82, 765, 343, 600, 718, 338, 505)'::cube) Planning time: 0.146 ms Execution time: 145.474 ms (5 rows) </code></pre> At 8 dimensions, it does help.

indexing and query high dimensional data in postgreSQL

Tags:

sql

multidimensional-array

indexing

postgresql

postgresql-9.3

I want to index data in height dimensions (128 dimensional vectors of integers in range of [0,254] are possible):

| id |      vector       |
|  1 | { 1, 0, ..., 254} |
|  2 | { 2, 128, ...,1}  |
|  . | { 1, 0, ..., 252} |
|  n | { 1, 2, ..., 251} |

I saw that PostGIS implemented R-Trees. So can I use these trees in PostGIS to index and query multidimensional vectors in Postgres?

I also saw that there is a index implementation for int arrays.

Now I have questions about how to perform a query.
Can I perform a knn-search and a radius search on an integer array? Maybe I also must define my own distance function. Is this possible? I want to use the Manhattan distance (block distance) for my queries.

I also can represent my vector as a binary string with the pattern v1;v2;...;vn. Does this help to perform the search?

For example if I had these two string:

1;2;1;1
1;3;2;2

The result / distance between these two strings should be 3.

659

asked Feb 15 '16 09:02

501 - not implemented

2 Answers

Perhaps a better choice would be the cube extension, since your area of interest is not individual integer, but full vector.

Cube supports GiST indexing, and Postgres 9.6 will also bring KNN indexing to cubes, supporting euclidean, taxicab (aka Manhattan) and chebishev distances.

It is a bit annoying that 9.6 is still in development, however there's no problem backporting patch for cube extension to 9.5 and I say that from experience.

Hopefully 128 dimensions will still be enough to get meaningful results.

How to do this?

First have an example table:

create extension cube;
create table vectors (id serial, vector cube);

Populate table with example data:

insert into vectors select id, cube(ARRAY[round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000)]) from generate_series(1, 2000000) id;

Then try selecting:

explain analyze SELECT * from vectors
order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=123352.07..123352.09 rows=10 width=76) (actual time=1705.499..1705.501 rows=10 loops=1)
   ->  Sort  (cost=123352.07..129852.07 rows=2600000 width=76) (actual time=1705.496..1705.497 rows=10 loops=1)
         Sort Key: (('(966, 82, 765, 343, 600, 718, 338, 505)'::cube <#> vector))
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Seq Scan on vectors  (cost=0.00..67167.00 rows=2600000 width=76) (actual time=0.038..998.864 rows=2600000 loops=1)
 Planning time: 0.172 ms
 Execution time: 1705.541 ms
(7 rows)

We should create an index:

create index vectors_vector_idx on vectors (vector);

Does it help:

explain analyze SELECT * from vectors
order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10;

--------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.41..1.93 rows=10 width=76) (actual time=41.339..143.915 rows=10 loops=1)
   ->  Index Scan using vectors_vector_idx on vectors  (cost=0.41..393704.41 rows=2600000 width=76) (actual time=41.336..143.902 rows=10 loops=1)
         Order By: (vector <#> '(966, 82, 765, 343, 600, 718, 338, 505)'::cube)
 Planning time: 0.146 ms
 Execution time: 145.474 ms
(5 rows)

At 8 dimensions, it does help.

157

answered Oct 19 '22 23:10

hruske

(Addendum to selected answer)

For people wanting more than 100 dimensions, beware: there's a 100 dimensions limit in cube extension.

The tricky part is that postgres allows you to create cubes with more than 100 dimensions just fine. It's when you try to restore a backup that it is refused (the worst time to realize that).

As recommended in documentation, I patched cube extension to support more dimensions. I made a docker image for it, and you can look at the Dockerfile to see how to do it yourself, from the github repos.

answered Oct 19 '22 23:10

kik

Related questions
                            
                                Stored Procedure return multiple result sets
                            
                                Using subquery to pull random value reveals same value every time
                            
                                Switch id numbers of two rows in MySql
                            
                                PostgreSQL cursors
                            
                                How to modify datatype of a column with a default value
                            
                                MYSQL Select from table, get newest/last 10 rows in table
                            
                                Select the first 150 rows, then the next 150 and so on?
                            
                                Join SQL Server tables on a like statement
                            
                                Nesting queries in SQL
                            
                                MySQL Inner Join Query Multiple Tables
                            
                                SQL - search by beginning of a word
                            
                                Get list of column names from a Firebird database table
                            
                                Preventing adjacent/overlapping entries with EXCLUDE in PostgreSQL
                            
                                MySQL insert error : ER_BAD_FIELD_ERROR: Unknown column '2525' in 'field list'
                            
                                SQL Get "ISO Year" for ISO Week
                            
                                SQL LocalDb Automatic Instance Startup Failure when called from Visual Studio 2013, but not SQL Server Management Studio
                            
                                How to select multiple rows filled with constants in Amazon Redshift?
                            
                                Knex: Create migration with FOREIGN KEY
                            
                                SUM of only TOP 10 rows
                            
                                SQL - Unique Key, Primary Key & Foreign Key

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With