Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any reason to include a `tsvector` column in a postgres table rather than in the index?

I have a table with about 100 million rows and a text field that I'd like to search over. I've come up with two methods for doing this and I'd like to know the performance implications of each method.

Method 1: This is the method recommended by every blog post I've seen online (e.g. 1 and 2.). The idea is to augment the table with a ts_vector column and index the new column.
A simple example is:

CREATE TABLE articles (
    id_articles BIGSERIAL PRIMARY KEY,
    text TEXT,
    text_tsv TSVECTOR
);
CREATE INDEX articles_index ON articles USING gin(text_tsv);

and then a trigger is used to ensure that the text and text_tsv columns remain up-to-date.
This seems wasteful to me, however, as now the TSVECTOR information must be stored in both the table and the index, and the database is made much more complicated. So I've come up with a second method.

Method 2: My idea is to eliminate the extra column and change the index to include the to_tsvector function directly, like so:

CREATE TABLE articles (
    id_articles BIGSERIAL PRIMARY KEY,
    text TEXT
);
CREATE INDEX articles_index ON articles USING gin(to_tsvector(text));

Question: Are there any downsides to using method 2 over method 1?

For my particular database, I've used the second method and I appear to get reasonable speedup for simple queries of a single word (search takes ~1 second). But when I have complex queries with several & and | operators in the to_tsquery function (and only ~10 matching results in the table), the search takes forever to run (many hours). If I switch to method 1, am I likely to see much faster query times for some reason?

If the slow performance of my queries is not due to my choice of method 2, is there anything else I might be able to do to speed up complex queries built with to_tsquery?

I'm using postgresql 10.10.

like image 926
Mike Izbicki Avatar asked Mar 02 '23 22:03

Mike Izbicki


1 Answers

The downside of not storing the tsvector is that it will have to be recompute the tsvector from the raw text in order to "recheck" that the row meets the query. This can be very slow.

Rechecks are necessary if the size of the bitmap of candidate matches overflows work_mem. For some operators rechecks are always required, such as the phrase match operators <->, <2>, etc.

like image 91
jjanes Avatar answered Apr 06 '23 16:04

jjanes