Despite what all the documentation says, I'm finding GIN indexes to be significantly slower than GIST indexes for pg_trgm related searches. This is on a table of 25 million rows with a relatively short text field (average length of 21 characters). Most of the rows of text are addresses of the form "123 Main st, City".
GIST index takes about 4 seconds with a search like
select suggestion from search_suggestions where suggestion % 'seattle';
But GIN takes 90 seconds and the following result when running with EXPLAIN ANALYZE:
Bitmap Heap Scan on search_suggestions  (cost=330.09..73514.15 rows=25043 width=22) (actual time=671.606..86318.553 rows=40482 loops=1)
  Recheck Cond: ((suggestion)::text % 'seattle'::text)
  Rows Removed by Index Recheck: 23214341
  Heap Blocks: exact=7625 lossy=223807
  ->  Bitmap Index Scan on tri_suggestions_idx  (cost=0.00..323.83 rows=25043 width=0) (actual time=669.841..669.841 rows=1358175 loops=1)
        Index Cond: ((suggestion)::text % 'seattle'::text)
Planning time: 1.420 ms
Execution time: 86327.246 ms
Note that over a million rows are being selected by the index, even though only 40k rows actually match. Any ideas why this is performing so poorly? This is on PostgreSQL 9.4.
Some issues stand out:
First, consider upgrading to a current version of Postgres. At the time of writing that's pg 9.6 or pg 10 (currently beta). Since Pg 9.4 there have been multiple improvements for GIN indexes, the additional module pg_trgm and big data in general.
Next, you need much more RAM, in particular a higher work_mem setting. I can tell from this line in the EXPLAIN output:
Heap Blocks: exact=7625 lossy=223807
"lossy" in the details for a Bitmap Heap Scan (with your particular numbers) indicates a dramatic shortage of work_mem. Postgres only collects block addresses in the bitmap index scan instead of row pointers because that's expected to be faster with your low work_mem setting (can't hold exact addresses in RAM). Many more non-qualifying rows have to be filtered in the following Bitmap Heap Scan this way. This related answer has details:
But don't set work_mem too high without considering the whole situation:
There may other problems, like index or table bloat or more configuration bottlenecks. But if you fix just these two items, the query should be much faster already.
Also, do you really need to retrieve all 40k rows in the example? You probably want to add a small LIMIT to the query and make it a "nearest-neighbor" search - in which case a GiST index is the better choice after all, because that is supposed to be faster with a GiST index. Example:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With