I'm creating custom forum software for a site I'm building, which includes 2 tables (that are relevant to this question): <code>topics</code> and <code>posts</code>. A post belongs to a topic, and the topic contains the subject, while each post contains the body. Here is the basic table structures with the columns relevant to my question: <pre class="prettyprint"><code>CREATE TABLE topics ( id bigserial NOT NULL, title varchar(128) NOT NULL, created timestamp with time zone NOT NULL default NOW(), updated timestamp with time zone NOT NULL default NOW(), PRIMARY KEY (id) ); CREATE TABLE posts ( id bigserial NOT NULL, topic_id bigint NOT NULL REFERENCES topics(id) ON DELETE CASCADE, body text NOT NULL, created timestamp with time zone NOT NULL default NOW(), updated timestamp with time zone NOT NULL default NOW(), PRIMARY KEY (id) ); </code></pre> Here are my two options on building fulltext indexes. Option 1: Create dynamic tsvector indexes on title/body columns. <pre class="prettyprint"><code>CREATE INDEX topics_title_idx ON topics USING gin(to_tsvector(title)); CREATE INDEX posts_body_idx ON posts USING gin(to_tsvector(body)); </code></pre> Option 2: Create extra columns to hold tsvector-ized title/body data, and add indexes on those. <pre class="prettyprint"><code>ALTER TABLE topics ADD COLUMN topics_vector tsvector NOT NULL; CREATE TRIGGER topics_ins BEFORE INSERT OR UPDATE ON topics FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(title_vector, 'pg_catalog.english', title); CREATE INDEX topics_title_idx ON topics USING gin(title_vector); ALTER TABLE posts ADD COLUMN posts_vector tsvector NOT NULL; CREATE TRIGGER posts_ins BEFORE INSERT OR UPDATE ON posts FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(body_vector, 'pg_catalog.english', body); CREATE INDEX posts_body_idx ON posts USING gin(body_vector); </code></pre> I'm debating between the two since option 1 will save me disk space, but provide slower searches, and option 2 will require additional disk space while providing faster searches. Let's pretend there are 20 new topics & 100 new posts per day. Which would you choose? What if the number of topics/posts per day way twice that? Five times that? Ten times? Does your decision of one vs. the other change?

Using Option 1 will not make your searches more slow. The <code>GIN</code> index will be used regardless of whether you created in on instantiated column or computed expression. You just need to change the query syntax: <pre class="prettyprint"><code>SELECT * FROM posts WHERE TO_TSVECTOR('english', title) @@ myquery </code></pre> in the first case, or <pre class="prettyprint"><code>SELECT * FROM posts WHERE title_vector @@ myquery </code></pre> in the second case. You probably can save a little time when using <code>TS_RANK</code> over the instantiated column.

<blockquote> Let's pretend there are 20 new topics & 100 new posts per day. Which would you choose? What if the number of topics/posts per day way twice that? Five times that? Ten times? Does your decision of one vs. the other change? </blockquote> That's about 36,000 posts a year. Doesn't matter. Probably doesn't matter with ten times that, even on a cheap machine. However, you might want a third table containing an explicit tsvector combining topic and body-text together. You can then use the built-in weighting system and run one search to provide the sort of search people generally expect on forums etc. That will mean writing custom triggers to update your tsvector when either source table is changed.

Dynamic or column-ized tsvector index?

Tags:

database

indexing

full-text-search

postgresql

database-design

I'm creating custom forum software for a site I'm building, which includes 2 tables (that are relevant to this question): topics and posts. A post belongs to a topic, and the topic contains the subject, while each post contains the body.

Here is the basic table structures with the columns relevant to my question:

CREATE TABLE topics (
  id bigserial NOT NULL,
  title varchar(128) NOT NULL,
  created timestamp with time zone NOT NULL default NOW(),
  updated timestamp with time zone NOT NULL default NOW(),
  PRIMARY KEY (id)
);

CREATE TABLE posts (
  id bigserial NOT NULL,
  topic_id bigint NOT NULL REFERENCES topics(id) ON DELETE CASCADE,
  body text NOT NULL,
  created timestamp with time zone NOT NULL default NOW(),
  updated timestamp with time zone NOT NULL default NOW(),
  PRIMARY KEY (id)
);

Here are my two options on building fulltext indexes.

Option 1: Create dynamic tsvector indexes on title/body columns.

CREATE INDEX topics_title_idx ON topics USING gin(to_tsvector(title));
CREATE INDEX posts_body_idx ON posts USING gin(to_tsvector(body));

Option 2: Create extra columns to hold tsvector-ized title/body data, and add indexes on those.

ALTER TABLE topics ADD COLUMN topics_vector tsvector NOT NULL;
CREATE TRIGGER topics_ins BEFORE INSERT OR UPDATE ON topics FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(title_vector, 'pg_catalog.english', title);
CREATE INDEX topics_title_idx ON topics USING gin(title_vector);

ALTER TABLE posts ADD COLUMN posts_vector tsvector NOT NULL;
CREATE TRIGGER posts_ins BEFORE INSERT OR UPDATE ON posts FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(body_vector, 'pg_catalog.english', body);
CREATE INDEX posts_body_idx ON posts USING gin(body_vector);

I'm debating between the two since option 1 will save me disk space, but provide slower searches, and option 2 will require additional disk space while providing faster searches.

Let's pretend there are 20 new topics & 100 new posts per day. Which would you choose? What if the number of topics/posts per day way twice that? Five times that? Ten times? Does your decision of one vs. the other change?

363

asked Oct 30 '09 02:10

Matt Huggins

2 Answers

Using Option 1 will not make your searches more slow.

The GIN index will be used regardless of whether you created in on instantiated column or computed expression.

You just need to change the query syntax:

SELECT  *
FROM    posts
WHERE   TO_TSVECTOR('english', title) @@ myquery

in the first case, or

SELECT  *
FROM    posts
WHERE   title_vector @@ myquery

in the second case.

You probably can save a little time when using TS_RANK over the instantiated column.

118

answered Oct 09 '22 22:10

Quassnoi

Let's pretend there are 20 new topics & 100 new posts per day. Which would you choose? What if the number of topics/posts per day way twice that? Five times that? Ten times? Does your decision of one vs. the other change?

That's about 36,000 posts a year. Doesn't matter. Probably doesn't matter with ten times that, even on a cheap machine.

However, you might want a third table containing an explicit tsvector combining topic and body-text together. You can then use the built-in weighting system and run one search to provide the sort of search people generally expect on forums etc. That will mean writing custom triggers to update your tsvector when either source table is changed.

answered Oct 09 '22 22:10

Richard Huxton

Related questions
                            
                                Google OAuth 2.0 User id datatype for MYSQL
                            
                                Service connecting to Firebase a bad idea?
                            
                                INSERT in single query into 2 tables postgresql
                            
                                Mysqldump --single-transaction option
                            
                                How to generate SQL schema from Spring Boot entities?
                            
                                Hikari: Failed to validate connection because connection is closed
                            
                                Where in Django can I run startup code that requires models?
                            
                                Does stored procedure help eliminates SQL injection / What are the benefits of stored procedured over normal SQL statement in apps?
                            
                                A way to export the results from Pig to a database
                            
                                How to write Unit Tests for functions that rely on dynamic data?
                            
                                What is going wrong with postgresql initdb? Why is the `UTF-8` encoding not getting enforced?
                            
                                Table with only one column or add a numeric primary key?
                            
                                file stream vs local save in sql server?
                            
                                postgresql: data type for md5 message digest?
                            
                                What does "ADD AUTO_INCREMENT value" mean in phpMyAdmin
                            
                                MySQL Community Server Limitations
                            
                                PG::ConnectionBad: could not translate host name error after running export DATABASE_URL=postgres://$(whoami)
                            
                                Password authentication failed for Docker's postgres container
                            
                                Weekly Schedules - How can you store this in a database?
                            
                                Staging database predicament

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With