I'm testing out the PostgreSQL Text-Search features, using the September data dump from StackOverflow as sample data. :-)
The naive approach of using LIKE
predicates or POSIX regular expression matching to search 1.2 million rows takes about 90-105 seconds (on my Macbook) to do a full table-scan searching for a keyword.
SELECT * FROM Posts WHERE body LIKE '%postgresql%';
SELECT * FROM Posts WHERE body ~ 'postgresql';
An unindexed, ad hoc text-search query takes about 8 minutes:
SELECT * FROM Posts WHERE to_tsvector(body) @@ to_tsquery('postgresql');
Creating a GIN index takes about 40 minutes:
ALTER TABLE Posts ADD COLUMN PostText TSVECTOR;
UPDATE Posts SET PostText = to_tsvector(body);
CREATE INDEX PostText_GIN ON Posts USING GIN(PostText);
(I realize I could also do this in one step by defining it as an expression index.)
Afterwards, a query assisted by a GIN index runs a lot faster -- this takes about 40 milliseconds:
SELECT * FROM Posts WHERE PostText @@ 'postgresql';
However, when I create a GiST index, the results are quite different. It takes less than 2 minutes to create the index:
CREATE INDEX PostText_GIN ON Posts USING GIST(PostText);
Afterwards, a query using the @@
text-search operator takes 90-100 seconds. So GiST indexes do improve an unindexed TS query from 8 minutes to 1.5 minutes. But that's no improvement over doing a full table-scan with LIKE
. It's useless in a web programming environment.
Am I missing something crucial to using GiST indexes? Do the indexes need to be pre-cached in memory or something? I am using a plain PostgreSQL installation from MacPorts, with no tuning.
What is the recommended way to use GiST indexes? Or does everyone doing TS with PostgreSQL skip GiST indexes and use only GIN indexes?
PS: I do know about alternatives like Sphinx Search and Lucene. I'm just trying to learn about the features provided by PostgreSQL itself.
For dynamic data, GiST indexes are faster to update. Specifically, GiST indexes are very good for dynamic data and fast if the number of unique words (lexemes) is under 100,000, while GIN indexes will handle 100,000+ lexemes better but are slower to update.
A GiST index is lossy, meaning that the index might produce false matches, and it is necessary to check the actual table row to eliminate such false matches. (PostgreSQL does this automatically when needed.) GiST indexes are lossy because each document is represented in the index by a fixed-length signature.
The most common and widely used index type is the B-tree index. This is the default index type for the CREATE INDEX command, unless you explicitly mention the type during index creation.
The docs have a nice overview of the performance differences between GiST and GIN indexes if you're interested: GiST and GIN Index Types.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With