Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are PostgreSQL Text-Search GiST indexes so much slower than GIN indexes?

I'm testing out the PostgreSQL Text-Search features, using the September data dump from StackOverflow as sample data. :-)

The naive approach of using LIKE predicates or POSIX regular expression matching to search 1.2 million rows takes about 90-105 seconds (on my Macbook) to do a full table-scan searching for a keyword.

SELECT * FROM Posts WHERE body LIKE '%postgresql%';
SELECT * FROM Posts WHERE body ~ 'postgresql';

An unindexed, ad hoc text-search query takes about 8 minutes:

SELECT * FROM Posts WHERE to_tsvector(body) @@ to_tsquery('postgresql'); 

Creating a GIN index takes about 40 minutes:

ALTER TABLE Posts ADD COLUMN PostText TSVECTOR;
UPDATE Posts SET PostText = to_tsvector(body);
CREATE INDEX PostText_GIN ON Posts USING GIN(PostText);

(I realize I could also do this in one step by defining it as an expression index.)

Afterwards, a query assisted by a GIN index runs a lot faster -- this takes about 40 milliseconds:

SELECT * FROM Posts WHERE PostText @@ 'postgresql'; 

However, when I create a GiST index, the results are quite different. It takes less than 2 minutes to create the index:

CREATE INDEX PostText_GIN ON Posts USING GIST(PostText);

Afterwards, a query using the @@ text-search operator takes 90-100 seconds. So GiST indexes do improve an unindexed TS query from 8 minutes to 1.5 minutes. But that's no improvement over doing a full table-scan with LIKE. It's useless in a web programming environment.

Am I missing something crucial to using GiST indexes? Do the indexes need to be pre-cached in memory or something? I am using a plain PostgreSQL installation from MacPorts, with no tuning.

What is the recommended way to use GiST indexes? Or does everyone doing TS with PostgreSQL skip GiST indexes and use only GIN indexes?

PS: I do know about alternatives like Sphinx Search and Lucene. I'm just trying to learn about the features provided by PostgreSQL itself.

like image 861
Bill Karwin Avatar asked Oct 08 '09 20:10

Bill Karwin


People also ask

Which of the following is an advantage of the GiST over the gin index in PostgreSQL?

For dynamic data, GiST indexes are faster to update. Specifically, GiST indexes are very good for dynamic data and fast if the number of unique words (lexemes) is under 100,000, while GIN indexes will handle 100,000+ lexemes better but are slower to update.

What is GiST index Postgres?

A GiST index is lossy, meaning that the index might produce false matches, and it is necessary to check the actual table row to eliminate such false matches. (PostgreSQL does this automatically when needed.) GiST indexes are lossy because each document is represented in the index by a fixed-length signature.

What is the most common index type used in PostgreSQL?

The most common and widely used index type is the B-tree index. This is the default index type for the CREATE INDEX command, unless you explicitly mention the type during index creation.


1 Answers

The docs have a nice overview of the performance differences between GiST and GIN indexes if you're interested: GiST and GIN Index Types.

like image 102
mattonrails Avatar answered Sep 23 '22 02:09

mattonrails