PostgreSQL full text search performance not acceptable when ordering by ts_rank_cd

Tags:

In my PostgreSQL 9.3 database, I have a table called articles. It looks kind of like this:

+------------+--------------------------------------------------------------+
|    Name    |                        Information                           |
+------------+--------------------------------------------------------------+
| id         | Auto incrememnt integer ID                                   |
| title      | text                                                         |
| category   | character varying(255) with index                            |
| keywords   | String with title and extra words used for indexing          |
| tsv        | Trigger updates w/ tsvector_update_trigger based on keywords |
+------------+--------------------------------------------------------------+

There's more columns in the table but I don't think they aren crucial to the question. The total size of the table is 94GB and about 29M rows.

I'm trying to run a query on a keyword search on a subset of 23M of the article rows. to do this I use the following query:

SELECT title, id FROM articles, plainto_tsquery('dog') AS q 
WHERE (tsv @@ q) AND category = 'animal' 
ORDER BY ts_rank_cd(tsv, q) DESC LIMIT 5

The problem with this is that it appears by running ts_rank_cd on each of the results first before it can sort them and therefore this query is very slow, about 2-3 minutes. I've read around a lot to try and find a solution and it was suggested that I wrap the search query in another query so that the ranking is only applied to the found results like so:

SELECT * FROM (
  SELECT title, id, tsv FROM articles, plainto_tsquery('dog') AS q 
  WHERE (tsv @@ q) AND category = 'animal'
) AS t1
ORDER BY ts_rank_cd(t1.tsv, plainto_tsquery('dog')) DESC LIMIT 5;

However, because the query is so short, there are 450K results in the subset. So it still takes a long time, it might be a bit quicker but I need this to be essentially instant.

The question: Is there anything I can do to keep this searching functionality within PostgreSQL?

It's nice having this logic kept in the database and means I don't require any extra servers or configuration for something like Solr or Elasticsearch. For example, would increasing the database instance capacity help things? Or would the cost efficiency not make sense when compared to shifting this logic over to a dedicated Elasticsearch instance.

The EXPLAIN response from the first query is as follows:

Limit  (cost=567539.41..567539.42 rows=5 width=465)
  ->  Sort  (cost=567539.41..567853.33 rows=125568 width=465)
        Sort Key: (ts_rank_cd(articles.tsv, q.q))
        ->  Nested Loop  (cost=1769.27..565453.77 rows=125568 width=465)
              ->  Function Scan on plainto_tsquery q  (cost=0.00..0.01 rows=1 width=32)
              ->  Bitmap Heap Scan on articles  (cost=1769.27..563884.17 rows=125567 width=433)
                    Recheck Cond: (tsv @@ q.q)
                    Filter: ((category)::text = 'animal'::text)
                    ->  Bitmap Index Scan on article_search_idx  (cost=0.00..1737.87 rows=163983 width=0)
                          Index Cond: (tsv @@ q.q)

And for the second query:

Aggregate  (cost=565453.77..565453.78 rows=1 width=0)
  ->  Nested Loop  (cost=1769.27..565139.85 rows=125568 width=0)
        ->  Function Scan on plainto_tsquery q  (cost=0.00..0.01 rows=1 width=32)
        ->  Bitmap Heap Scan on articles  (cost=1769.27..563884.17 rows=125567 width=351)
              Recheck Cond: (tsv @@ q.q)
              Filter: ((category)::text = 'animal'::text)
              ->  Bitmap Index Scan on article_search_idx  (cost=0.00..1737.87 rows=163983 width=0)
                    Index Cond: (tsv @@ q.q)

563

asked Jul 30 '14 09:07

betamax

1 Answers

You simply can't use an index over ts_rank_cd, because the resulting ranking value from it is depending on your query. Therefore all rank values for the whole result set must be computed every time you run a query, before the result set can be sorted and limited by this value.

If your search logic allows you could avoid this bottleneck by precompute a relevance value for each record once, create an index over it, and use this as sort column instead of the cover sensity for each query.

Even though you said you didn't want to, I suggest you look into a search engine that could work together with Postgresql, such as Sphinx. The default BM25 ranker should work fine. You can still set column weights as well, if you have to (http://sphinxsearch.com/docs/current.html#api-func-setfieldweights).

Update: This is also stated in the documentation:

"Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches."

See http://www.postgresql.org/docs/8.3/static/textsearch-controls.html

answered Oct 14 '22 05:10

benjist

Related questions
                            
                                MYSQL inserting multiple rows unexpectedly
                            
                                Extending dplyr and use of internal functions
                            
                                Does Oracle 12 have problems with local collection types in SQL?
                            
                                ORA-12704: character set mismatch when performing multi-row INSERT of nullable NVARCHAR's
                            
                                Inconsistent query results from MySQL
                            
                                Thoughts on index creation for SQL Server for missing indexes
                            
                                T-SQL query performance puzzle: Why does using a variable make a difference?
                            
                                How to insert binary data into MSSQL using PDO
                            
                                Which option to select to use sql in C++ [closed]
                            
                                connect to sql "server group" with php to run query over multiple databases
                            
                                Need to get attribute name-value pairs for root element only, in T-SQL query
                            
                                How to get cursor from Oracle using Groovy?
                            
                                MySql | Faceted search changed
                            
                                Parallelizing pandas pyodbc SQL database calls
                            
                                Why does SQL Server GEOGRAPHY allow longitudes between -15069° and +15069°? Why ±15069°?
                            
                                SQL Server 2005 For XML Explicit - Need help formatting
                            
                                Does AdventureWorks db have tutorials / exercises? [closed]
                            
                                PDO::PARAM_FLOAT does not exist, why?
                            
                                Reverse join in MySQL
                            
                                How could I execute both a query and an DML statement with the same code in groovy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PostgreSQL full text search performance not acceptable when ordering by ts_rank_cd

Tags:

sql

full-text-search

postgresql

betamax

People also ask

1 Answers

benjist

Recent Activity

Donate For Us