Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to use PostgreSQL full text search ranking

Following on from this answer I want to know what the best way to use PostgreSQL's built-in full text search is if I want to sort by rank, and limit to only matching queries.

Let's assume a very simple table.

CREATE TABLE pictures (
  id SERIAL PRIMARY KEY,
  title varchar(300),
  ...
)

or whatever. Now I want to search the title field. First I create an index:

CREATE INDEX pictures_title ON pictures 
  USING gin(to_tsvector('english', title));

Now I want to search for 'small dog'. This works:

SELECT pictures.id, 
       ts_rank_cd(
         to_tsvector('english', pictures.title), 'small dog'
       ) AS score
FROM pictures
ORDER BY score DESC

But what I really want is this:

SELECT pictures.id, 
       ts_rank_cd(
         to_tsvector('english', pictures.title), to_tsquery('small dog')
       ) AS score
FROM pictures 
WHERE to_tsvector('english', pictures.title) @@ to_tsquery('small dog')
ORDER BY score DESC

Or alternatively this (which doesn't work - can't use score in the WHERE clause):

SELECT pictures.id, 
       ts_rank_cd(
         to_tsvector('english', pictures.title), to_tsquery('small dog')
       ) AS score
FROM pictures WHERE score > 0
ORDER BY score DESC

What's the best way to do this? My questions are many-fold:

  1. If I use the version with repeated to_tsvector(...) will it call that twice, or is it smart enough to cache the results somehow?
  2. Is there a way to do it without repeating the to_ts... function calls?
  3. Is there a way to use score in the WHERE clause at all?
  4. If there is, would it be better to filter by score > 0 or use the @@ thing?
like image 550
Timmmm Avatar asked Oct 17 '12 11:10

Timmmm


People also ask

Is PostgreSQL good for full text search?

Yes, You Can Keep Full-Text Search in Postgres You can get even deeper and make your Postgres full-text search even more robust, by implementing features such as highlighting results, or writing your own custom dictionaries or functions.

What is FTS in PostgreSQL?

Understanding Postgres Full Text Search: 10 Critical Aspects. Sarad on Database, PostgreSQL, Tutorial • August 28th, 2017 • Write for Hevo. Setting up PostgreSQL Full Text Search (FTS) helps users bring up a fully-fledged search engine without a steep learning curve and new software to install.

Is Elasticsearch faster than Postgres?

And the more size you want to search in, the more Elasticsearch is better than PostgreSQL in performance. Additionally, you could also get many benefits and great performance if you pre-process the posts into several fields and indexes well before storing into Elasticsearch.

How do I find text in PostgreSQL function?

Text search by using to_tsvector function and operator We have search text by using the to_tsvector function in PostgreSQL. In to_tsvector, “ts” is defined as text search. In to_tsvector, the tsvector is the data type of to_tsvector function. This function will return the lexeme tokens with pointers in PostgreSQL.


2 Answers

The use of the @@ operator will utilize the full text GIN index, while the test for score > 0 would not.

I created a table as in the Question, but added a column named title_tsv:

CREATE TABLE test_pictures (
  id        BIGSERIAL,
  title     text,
  title_tsv tsvector
);

CREATE INDEX ix_pictures_title_tsv ON test_pictures 
    USING gin(title_tsv);

I populated the table with some test data:

INSERT INTO test_pictures(title, title_tsv)
SELECT T.data, to_tsvector(T.data) 
FROM   some_table T;

Then I ran the previously accepted answer with explain analyze:

EXPLAIN ANALYZE 
SELECT  score, id, title
FROM (
    SELECT ts_rank_cd(P.title_tsv, to_tsquery('address & shipping')) AS score
        ,P.id        
        ,P.title
    FROM test_pictures as P
) S
WHERE score > 0
ORDER BY score DESC;

And got the following. Note the execution time of 5,015 ms

QUERY PLAN                                                                                                                                    |
----------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge  (cost=274895.48..323298.03 rows=414850 width=60) (actual time=5010.844..5011.330 rows=1477 loops=1)                             |
  Workers Planned: 2                                                                                                                          |
  Workers Launched: 2                                                                                                                         |
  ->  Sort  (cost=273895.46..274414.02 rows=207425 width=60) (actual time=4994.539..4994.555 rows=492 loops=3)                                |
        Sort Key: (ts_rank_cd(p.title_tsv, to_tsquery('address & shipping'::text))) DESC                                                      |
        Sort Method: quicksort  Memory: 131kB                                                                                                 |
        ->  Parallel Seq Scan on test_pictures p  (cost=0.00..247776.02 rows=207425 width=60) (actual time=17.672..4993.997 rows=492 loops=3) |
              Filter: (ts_rank_cd(title_tsv, to_tsquery('address & shipping'::text)) > '0'::double precision)                                 |
              Rows Removed by Filter: 497296                                                                                                  |
Planning time: 0.159 ms                                                                                                                       |
Execution time: 5015.664 ms                                                                                                                   |

Now compare that with the @@ operator:

EXPLAIN ANALYZE
SELECT ts_rank_cd(to_tsvector(P.title), to_tsquery('address & shipping')) AS score
    ,P.id
    ,P.title
FROM    test_pictures as P
WHERE P.title_tsv @@ to_tsquery('address & shipping')
ORDER BY score DESC;

And the results coming in with an execution time of about 29 ms:

QUERY PLAN                                                                                                                                       |
-------------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge  (cost=13884.42..14288.35 rows=3462 width=60) (actual time=26.472..26.942 rows=1477 loops=1)                                        |
  Workers Planned: 2                                                                                                                             |
  Workers Launched: 2                                                                                                                            |
  ->  Sort  (cost=12884.40..12888.73 rows=1731 width=60) (actual time=17.507..17.524 rows=492 loops=3)                                           |
        Sort Key: (ts_rank_cd(to_tsvector(title), to_tsquery('address & shipping'::text))) DESC                                                  |
        Sort Method: quicksort  Memory: 171kB                                                                                                    |
        ->  Parallel Bitmap Heap Scan on test_pictures p  (cost=72.45..12791.29 rows=1731 width=60) (actual time=1.781..17.268 rows=492 loops=3) |
              Recheck Cond: (title_tsv @@ to_tsquery('address & shipping'::text))                                                                |
              Heap Blocks: exact=625                                                                                                             |
              ->  Bitmap Index Scan on ix_pictures_title_tsv  (cost=0.00..71.41 rows=4155 width=0) (actual time=3.765..3.765 rows=1477 loops=1)  |
                    Index Cond: (title_tsv @@ to_tsquery('address & shipping'::text))                                                            |
Planning time: 0.214 ms                                                                                                                          |
Execution time: 28.995 ms                                                                                                                        |

As you can see in the execution plan, the index ix_pictures_title_tsv was used in the second query, but not in the first one, making the query with the @@ operator a whopping 172 times faster!

like image 153
isapir Avatar answered Sep 20 '22 12:09

isapir


select *
from (
    SELECT
        pictures.id,
        ts_rank_cd(to_tsvector('english', pictures.title), 
        to_tsquery('small dog')) AS score
    FROM pictures
) s
WHERE score > 0
ORDER BY score DESC
like image 33
Clodoaldo Neto Avatar answered Sep 19 '22 12:09

Clodoaldo Neto