Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to index a string array column for pg_trgm `'term' % ANY (array_column)` query?

I have tried an ordinary Postgres gin index as well as the pg_trgm gin_trgm_ops and gist_trgm_ops indexes (using this workaround: https://stackoverflow.com/a/33016333/283398).

However EXPLAIN on my query 'term' % ANY (array_column) shows a sequential scan even after having executed set enable_seqscan = off;.

(For my use case, I need partial matches and pg_trgm seems like a much better fit than full-text search because my data is not linguistic. The quality of my pg_trgm results is very good.)

My use case is rows with an array column containing a mix of first names and full names (space-delimited). The search term may be a first, last, or full name (space-delimited). The pg_trgm % operator results are case insensitive and appear to weight highly matches at the beginning and end of the names in the array column, which is great for full names because it finds matching first and last names but not necessarily middle names.

https://github.com/theirix/parray_gin is promising, but old, and doesn't claim to support Postgres newer than 9.2.

like image 419
Gabe Kopley Avatar asked Sep 13 '16 23:09

Gabe Kopley


2 Answers

Why this does not work

The index type (i.e. operator class) gin_trgm_ops is based on % operator, which works on two text arguments:

CREATE OPERATOR trgm.%(
  PROCEDURE = trgm.similarity_op,
  LEFTARG = text,
  RIGHTARG = text,
  COMMUTATOR = %,
  RESTRICT = contsel,
  JOIN = contjoinsel);

You cannot use gin_trgm_ops for arrays. An index defined for an array column will never work with any(array[...]) because individual elements of arrays are not indexed. Indexing an array would require a different type of index, namely gin array index.

Fortunately, the index gin_trgm_ops has been so cleverly designed that it is working with operators like and ilike, which can be used as an alternative solution (example described below).

Test table

has two columns (id serial primary key, names text[]) and contains 100000 latin sentences split into array elements.

select count(*), sum(cardinality(names))::int words from test;

 count  |  words  
--------+---------
 100000 | 1799389

select * from test limit 1;

 id |                                                     names                                                     
----+---------------------------------------------------------------------------------------------------------------
  1 | {fugiat,odio,aut,quis,dolorem,exercitationem,fugiat,voluptates,facere,error,debitis,ut,nam,et,voluptatem,eum}

Searching for the word fragment praesent gives 7051 rows in 2400 ms:

explain analyse
select count(*)
from test
where 'praesent' % any(names);

                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=5479.49..5479.50 rows=1 width=0) (actual time=2400.866..2400.866 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..5477.00 rows=996 width=0) (actual time=1.464..2400.271 rows=7051 loops=1)
         Filter: ('praesent'::text % ANY (names))
         Rows Removed by Filter: 92949
 Planning time: 1.038 ms
 Execution time: 2400.916 ms

Materialized view

One solution is to normalize the model, involving the creation of a new table with a single name in one row. Such restructuring may be difficult to implement and sometimes impossible due to existing queries, views, functions, or other dependencies. A similar effect can be achieved without changing table structure, using a materialized view.

create materialized view test_names as
    select id, name, name_id
    from test
    cross join unnest(names) with ordinality u(name, name_id)
    with data;

With ordinality is not necessary, but can be useful when aggregating the names in the same order as in the main table. Querying test_names gives the same results as the main table in the same time.

After creating the index execution time decreases repeatedly:

create index on test_names using gin (name gin_trgm_ops);

explain analyse
select count(distinct id)
from test_names
where 'praesent' % name

                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4888.89..4888.90 rows=1 width=4) (actual time=56.045..56.045 rows=1 loops=1)
   ->  Bitmap Heap Scan on test_names  (cost=141.95..4884.39 rows=1799 width=4) (actual time=10.513..54.987 rows=7230 loops=1)
         Recheck Cond: ('praesent'::text % name)
         Rows Removed by Index Recheck: 7219
         Heap Blocks: exact=8122
         ->  Bitmap Index Scan on test_names_name_idx  (cost=0.00..141.50 rows=1799 width=0) (actual time=9.512..9.512 rows=14449 loops=1)
               Index Cond: ('praesent'::text % name)
 Planning time: 2.990 ms
 Execution time: 56.521 ms

The solution has a few drawbacks. Because the view is materialized, the data is stored twice in the database. You have to remember to refresh the view after changes to the main table. And queries may be more complicated because of the need to join the view to the main table.

Using ilike

We can use ilike on the arrays represented as text. We need an immutable function to create the index on the array as a whole:

create function text(text[])
returns text language sql immutable as
$$ select $1::text $$

create index on test using gin (text(names) gin_trgm_ops);

and use the function in queries:

explain analyse
select count(*)
from test
where text(names) ilike '%praesent%' 

                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=117.06..117.07 rows=1 width=0) (actual time=60.585..60.585 rows=1 loops=1)
   ->  Bitmap Heap Scan on test  (cost=76.08..117.03 rows=10 width=0) (actual time=2.560..60.161 rows=7051 loops=1)
         Recheck Cond: (text(names) ~~* '%praesent%'::text)
         Heap Blocks: exact=2899
         ->  Bitmap Index Scan on test_text_idx  (cost=0.00..76.08 rows=10 width=0) (actual time=2.160..2.160 rows=7051 loops=1)
               Index Cond: (text(names) ~~* '%praesent%'::text)
 Planning time: 3.301 ms
 Execution time: 60.876 ms

60 versus 2400 ms, quite nice result without the need of creating additional relations.

This solution seems simpler and requiring less work, provided however that ilike, which is less precise tool than the trgm % operator, is sufficient.

Why should we use ilike rather than % for whole arrays as text? The similarity depends largely on the length of the texts. It is very difficult to choose an appropriate limit for the search a word in long texts of various length. E.g. with limit = 0.3 we have the results:

with data(txt) as (
values
    ('praesentium,distinctio,modi,nulla,commodi,tempore'),
    ('praesentium,distinctio,modi,nulla,commodi'),
    ('praesentium,distinctio,modi,nulla'),
    ('praesentium,distinctio,modi'),
    ('praesentium,distinctio'),
    ('praesentium')
)
select length(txt), similarity('praesent', txt), 'praesent' % txt "matched?"
from data;

 length | similarity | matched? 
--------+------------+----------
     49 |   0.166667 | f           <--!
     41 |        0.2 | f           <--!
     33 |   0.228571 | f           <--!
     27 |   0.275862 | f           <--!
     22 |   0.333333 | t
     11 |   0.615385 | t
(6 rows)
like image 156
klin Avatar answered Oct 22 '22 12:10

klin


I created a test table and a function called f that only converts to text.

CREATE OR REPLACE FUNCTION getNArray(el text[], count int) RETURNS text[] AS $$
  SELECT array_agg(el[random()*(array_length(el,1)-1)+1]) FROM generate_series(1,count) g(i)
$$
VOLATILE
LANGUAGE SQL;

DROP TABLE testGin;
CREATE TABLE testGin(id serial PRIMARY KEY, array_column text[]);

WITH t(ray) AS(
  SELECT (string_to_array(pg_read_file('words.list')::text,E'\n')) 
) 
INSERT INTO testGin(array_column)
SELECT getNArray(T.ray, 4) FROM T, generate_series(1,100000);

The cast function:

CREATE OR REPLACE FUNCTION f(arr text[]) RETURNS text AS $$
   SELECT arr::text
 LANGUAGE SQL IMMUTABLE;

CREATE INDEX ON testGin USING GIN(f(array_column) gin_trgm_ops);

The usage with ILIKE:

postgres=# EXPLAIN SELECT id FROM testgin WHERE f(array_column) ilike '%test%';
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Bitmap Heap Scan on testgin  (cost=34.82..1669.63 rows=880 width=4)
   Recheck Cond: (f(array_column) ~~* '%test%'::text)
   ->  Bitmap Index Scan on testgin_f_idx  (cost=0.00..34.60 rows=880 width=0)
         Index Cond: (f(array_column) ~~* '%test%'::text)
(4 rows)

If you want a more accurate search by including the % operator, you can do as bellow. This will scan the index and then, it will apply your filter:

postgres=# explain SELECT id, array_column FROM testgin WHERE 'response' % ANY (array_column) and f(array_column) ~ 'response';
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Bitmap Heap Scan on testgin  (cost=76.08..120.38 rows=1 width=85)
   Recheck Cond: (f(array_column) ~ 'response'::text)
   Filter: ('response'::text % ANY (array_column))
   ->  Bitmap Index Scan on testgin_f_idx  (cost=0.00..76.08 rows=11 width=0)
         Index Cond: (f(array_column) ~ 'response'::text)
(5 rows)
like image 32
3manuek Avatar answered Oct 22 '22 14:10

3manuek