In my PostgreSQL 9.3 database, I have a table called articles
. It looks kind of like this:
+------------+--------------------------------------------------------------+
| Name | Information |
+------------+--------------------------------------------------------------+
| id | Auto incrememnt integer ID |
| title | text |
| category | character varying(255) with index |
| keywords | String with title and extra words used for indexing |
| tsv | Trigger updates w/ tsvector_update_trigger based on keywords |
+------------+--------------------------------------------------------------+
There's more columns in the table but I don't think they aren crucial to the question. The total size of the table is 94GB and about 29M rows.
I'm trying to run a query on a keyword search on a subset of 23M of the article
rows. to do this I use the following query:
SELECT title, id FROM articles, plainto_tsquery('dog') AS q
WHERE (tsv @@ q) AND category = 'animal'
ORDER BY ts_rank_cd(tsv, q) DESC LIMIT 5
The problem with this is that it appears by running ts_rank_cd
on each of the results first before it can sort them and therefore this query is very slow, about 2-3 minutes. I've read around a lot to try and find a solution and it was suggested that I wrap the search query in another query so that the ranking is only applied to the found results like so:
SELECT * FROM (
SELECT title, id, tsv FROM articles, plainto_tsquery('dog') AS q
WHERE (tsv @@ q) AND category = 'animal'
) AS t1
ORDER BY ts_rank_cd(t1.tsv, plainto_tsquery('dog')) DESC LIMIT 5;
However, because the query is so short, there are 450K results in the subset. So it still takes a long time, it might be a bit quicker but I need this to be essentially instant.
The question: Is there anything I can do to keep this searching functionality within PostgreSQL?
It's nice having this logic kept in the database and means I don't require any extra servers or configuration for something like Solr or Elasticsearch. For example, would increasing the database instance capacity help things? Or would the cost efficiency not make sense when compared to shifting this logic over to a dedicated Elasticsearch instance.
The EXPLAIN response from the first query is as follows:
Limit (cost=567539.41..567539.42 rows=5 width=465)
-> Sort (cost=567539.41..567853.33 rows=125568 width=465)
Sort Key: (ts_rank_cd(articles.tsv, q.q))
-> Nested Loop (cost=1769.27..565453.77 rows=125568 width=465)
-> Function Scan on plainto_tsquery q (cost=0.00..0.01 rows=1 width=32)
-> Bitmap Heap Scan on articles (cost=1769.27..563884.17 rows=125567 width=433)
Recheck Cond: (tsv @@ q.q)
Filter: ((category)::text = 'animal'::text)
-> Bitmap Index Scan on article_search_idx (cost=0.00..1737.87 rows=163983 width=0)
Index Cond: (tsv @@ q.q)
And for the second query:
Aggregate (cost=565453.77..565453.78 rows=1 width=0)
-> Nested Loop (cost=1769.27..565139.85 rows=125568 width=0)
-> Function Scan on plainto_tsquery q (cost=0.00..0.01 rows=1 width=32)
-> Bitmap Heap Scan on articles (cost=1769.27..563884.17 rows=125567 width=351)
Recheck Cond: (tsv @@ q.q)
Filter: ((category)::text = 'animal'::text)
-> Bitmap Index Scan on article_search_idx (cost=0.00..1737.87 rows=163983 width=0)
Index Cond: (tsv @@ q.q)
Yes, You Can Keep Full-Text Search in Postgres You can get even deeper and make your Postgres full-text search even more robust, by implementing features such as highlighting results, or writing your own custom dictionaries or functions.
To summarize, we learnt how to perform full-text search operation in PostgreSQL. If you liked our article, check out the book Mastering PostgreSQL 10 to understand how to perform operations such as indexing, query optimization, concurrent transactions, table partitioning, server tuning, and more.
Elasticsearch is faster than Postgres when it comes to searching for data. Elasticsearch is a powerful search engine that is often faster than Postgres when it comes to searching for data. Elasticsearch can be used to search for documents, images, and other data stored in a database.
Yes the number of columns will - indirectly - influence the performance. The data in the columns will also affect the speed. Why is that? Every DBMS stores rows in blocks - typically 8k blocks, but not necessarily.
You simply can't use an index over ts_rank_cd, because the resulting ranking value from it is depending on your query. Therefore all rank values for the whole result set must be computed every time you run a query, before the result set can be sorted and limited by this value.
If your search logic allows you could avoid this bottleneck by precompute a relevance value for each record once, create an index over it, and use this as sort column instead of the cover sensity for each query.
Even though you said you didn't want to, I suggest you look into a search engine that could work together with Postgresql, such as Sphinx. The default BM25 ranker should work fine. You can still set column weights as well, if you have to (http://sphinxsearch.com/docs/current.html#api-func-setfieldweights).
Update: This is also stated in the documentation:
"Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches."
See http://www.postgresql.org/docs/8.3/static/textsearch-controls.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With